Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning

輪読日:2015/07/02

  • Login to see the comments

(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning

  1. 1. Facial Landmark Detection by Deep Multi-task Learning 2015/7/2 Masahiro Suzuki
  2. 2. Contents ¤ Paper Information ¤ Introduction ¤ Related work ¤ Tasks-Constrained Deep Convolutional Network ¤ Experiment ¤ Conclusion
  3. 3. Paper Information Title : Facial Landmark Detection by Deep Multi-task Learning (2014) Authors : Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang ¤ The Chinese University of Hong Kong / Multimedia Laboratory Deep Learning (CNN) + Multitask Learning ¤ Motivation ¤ I’m studying lifelong learning (online multitask learning) by deep learning
  4. 4. Facial landmark detection ¤ Facial landmark detection is a fundamental component in many face analysis task ¤ facial attribute inference ¤ face verification ¤ face recognition ¤ remains a formidable challenge ¤ partial occlusion and large head pose variations
  5. 5. Approach the authors thought that … ¤ facial landmark detection is not a standalone problem ¤ its estimation can be influencedby a number of heterogeneous and subtly correlated factors Main task Auxiliary task Multitask learning
  6. 6. Contribution They propose a Tasks-Constrained Deep Convolutional Network (TCDCN) ¤ the first attempt to investigate how facial landmark detection can be optimized together with heterogeneous but subtly correlated tasks ¤ show that … ¤ the representations learned from related tasks facilitate the learning of the main task ¤ tasks relatedness are captured implicitly by the proposed model ¤ the proposed approach outperforms the existing methods ¤ demonstrate the effectiveness of using five-landmark estimation as robust initialization for improving a state-of-the-art face alignment method
  7. 7. Contents ¤ Paper Information ¤ Introduction ¤ Related work ¤ Tasks-Constrained Deep Convolutional Network ¤ Experiment ¤ Conclusion
  8. 8. Facial landmark detection regression-based method ¤ 画像パッチからSVRを使ってlandmarkを直接推定 ¤ 多くの先⾏研究がランダム回帰フォレストを利⽤ ¤ 最初にlandmarkを推定してから繰り返すので、初期値依存 template fitting method ¤ 顔のテンプレートを画像に当てはめる ¤ face detection, facial landmark detection, pose estimationを同時に できる 5 ● Regression-based method ● Template fitting method ● Cascaded CNN 顔特徴点検出の先行研究 Valstar, M., Martinez, B., Binefa, X., Pantic, M.: Facial point detection using boosted regression and graph models. In: CVPR. pp. 2729-2736 (2010) Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. PAMI 23(6), 681-685 (2001) Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: CVPR. pp. 3476-3483 (2013) 回帰で、点の位置を直接求める 位置や見た目のモデルをあてはめる 同じ研究室の手法 特徴点ごとに分割して段階的にCNNを適用. CNN数が多い. 23 CNNs. 先行研究に対し,補助的なタスクを使うことと, Raw-pixel入力のCNNで,Cascadeせずに 少ない処理時間で処理できることが特徴. gression-based method mplate fitting method caded CNN 顔特徴点検出の先行研究 Martinez, B., Binefa, X., Pantic, M.: detection using boosted regression models. In: CVPR. pp. 2729-2736 (2010) , Edwards, G.J., Taylor, C.J.: arance models. 681-685 (2001) 回帰で、点の位置を直接求める 置や見た目のモデルをあてはめる
  9. 9. Landmark detection by CNN cascaded CNN [Sun et al. 2013] ¤ 顔を予め幾つかのパーツに分けてそれぞれCNNでlandmarkを推定し、最 後に平均をとって出⼒ ¤ 元論⽂を読むと、段階ごとにCNNを適⽤してるっぽい ¤ 本研究に最も近い研究(本著者と同じ研究室) Figure 2: Three-level cascaded convolutional networks. The input is the face region returned by a face detector. The three networks at level 1 are denoted as F1, EN1, and NM1. Networks at level 2 are denoted as LE21, LE22, RE21, RE22, N21, N22, LM21, LM22, RM21, and RM22. Both LE21 and LE22 predict the left eye center, and so forth. Networks at level 3
  10. 10. Multi-task learning ¤ deep learningとmulti-task learningは相性がいい ¤ あるタスクで学習した特徴量を他の特徴量でも利⽤できる ¤ 通常のマルチタスク学習では、それぞれのタスクが同じ難易度・収束 率と考えている ¤ 今回の問題は各タスクが平等ではないのでそのままでは利⽤できない 本研究ではタスクごとに早期終了(early-stopping)を設定 ([Caruana et al. 1997]がヒント)
  11. 11. blem Formulation ional multi-task learning (MTL) seeks to improve the ge ce of multiple related tasks by learning them jointly. Supp T tasks and the training data for the t-th task are denote {1, . . . , T}, i = {1, . . . , N}, with xt i 2 Rd and yt i 2 R bein label, respectively1 . The goal of the MTL is to minimize argmin {wt}T t=1 TX t=1 NX i=1 `(yt i, f(xt i; wt )) + (wt ), ; wt ) is a function of xt and parameterized by a weight ve on is denoted by `(·). A typical choice is the least square f nge loss for classification. The (wt ) is the regularizatio he complexity of weights. Problem Formulation ¤ 従来のマルチタスク学習は、複数の関連するタスクを同時に学習する ことで汎化性能を⾼める 訓練事例集合 タスク 損失関数 正則化項 ラベル 素性 重み
  12. 12. Proposed Formulation ¤ 本研究のマルチタスク学習 特徴 ¤ 異なる2つの誤差関数を同時に最適化できる(回帰とクラス分類でも可能) ¤ 素性xがタスク依存でなく共通 loss function is denoted by `(·). A typical choice is the least square for regression and the hinge loss for classification. The (wt ) is the regularization term that penalizes the complexity of weights. In contrast to conventional MTL that maximizes the performance of all tasks our aim is to optimize the main task r, which is facial landmark detection, with the assistances of arbitrary number of related/auxiliary tasks a 2 A. Examples or related tasks include facial pose estimation and attribute inference. To this end, our problem can be formulated as argmin Wr,{Wa}a2A NX i=1 `r (yr i , f(xi; Wr )) + NX i=1 X a2A a `a (ya i , f(xi; Wa )), (2) 1 In this paper, scalar, vector, and matrix are denoted by lowercase, bold lowercase and bold capital letter, respectively. メインタスク 補助タスク a番⽬の補助タスクの重要度
  13. 13. Proposed Formulation ¤ メインタスクが回帰問題、補助タスクがクラス分類なので、誤差関数 はそれぞれ2乗誤差、クロスエントロピー誤差となる ¤ 共有する画像の特徴量をDeep CNで学習 これら2つの式を合わせて学習する メインタスク 補助タスク can be combined, while existing methods [30] that employ Eq.(1) assume implic- itly that the loss functions across all tasks are identical. Second, Eq.(1) allows data xt i in di↵erent tasks to have di↵erent input representations, while Eq.(2) focuses on a shared input representation xi. The latter is more suitable for our problem, since all tasks share similar facial representation. In the following, we formulate our facial landmark detection model based on Eq.(2). Suppose we have a set of feature vectors in a shared feature space across tasks {xi}N i=1 and their corresponding labels {yr i , yp i , yg i , yw i , ys i }N i=1, where yr i is the target of landmark detection and the remaining are the targets of auxiliary tasks, including inferences of ‘pose’, ‘gender’, ‘wear glasses’, and ‘smiling’. More specifically, yr i 2 R10 is the 2D coordinates of the five landmarks (centers of the eyes, nose, corners of the mouth), yp i 2 {0, 1, .., 4} indicates five di↵erent poses (0 , ±30 , ±60 ), and yg i , yw i , ys i 2 {0, 1} are binary attributes. It is reasonable to employ the least square and cross-entropy as the loss functions for the main task (regression) and the auxiliary tasks (classification), respectively. Therefore, the objective function can be rewritten as argmin Wr,{Wa} 1 2 NX i=1 kyr i f(xi; Wr )k2 NX i=1 X a2A a ya i log(p(ya i |xi; Wa ))+ TX t=1 kWk2 2, (3) where f(xi; Wr ) = (Wr ) T xi in the first term is a linear function. The second term is a softmax function p(yi = m|xi) = exp{(Wa m)T xi} P j exp{(Wa j )T xi} , which models the class posterior probability (Wa j denotes the jth column of the matrix), and the third term penalizes large weights (W = {Wr , {Wa }}). In this work, we adopt the deep convolutional network (DCN) to jointly learn the share feature space x, since the unique structure of DCN allows for multitask and shared representation. tasks, including inferences of ‘pose’, ‘gender’, ‘wear glasses’, and ‘smiling’. More specifically, yr i 2 R10 is the 2D coordinates of the five landmarks (centers of the eyes, nose, corners of the mouth), yp i 2 {0, 1, .., 4} indicates five di↵erent poses (0 , ±30 , ±60 ), and yg i , yw i , ys i 2 {0, 1} are binary attributes. It is reasonable to employ the least square and cross-entropy as the loss functions for the main task (regression) and the auxiliary tasks (classification), respectively. Therefore, the objective function can be rewritten as argmin Wr,{Wa} 1 2 NX i=1 kyr i f(xi; Wr )k2 NX i=1 X a2A a ya i log(p(ya i |xi; Wa ))+ TX t=1 kWk2 2, (3) where f(xi; Wr ) = (Wr ) T xi in the first term is a linear function. The second term is a softmax function p(yi = m|xi) = exp{(Wa m)T xi} P j exp{(Wa j )T xi} , which models the class posterior probability (Wa j denotes the jth column of the matrix), and the third term penalizes large weights (W = {Wr , {Wa }}). In this work, we adopt the deep convolutional network (DCN) to jointly learn the share feature space x, since the unique structure of DCN allows for multitask and shared representation. In particular, given a face image x0 , the DCN projects it to higher level representation gradually by learning a sequence of non-linear mappings x0 ((Ws1 )T x0 ) ! x1 ((Ws2 )T x1 ) ! ... ((Wsl )T xl 1 ) ! xl . (4) Here, (·) and Wsl indicate the non-linear activation function and the filters needed to be learned in the layer l of DCN. For instance, xl = ⇣ (Wsl ) T xl 1 ⌘ . Note that xl is the shared representation between the main task r, and related
  14. 14. Tasks-Constrained Deep Convolutional Network 全体構造 DCN部分 • モデルは各タスクで共通 マルチタスク部分
  15. 15. Task-wise early stopping ¤ マルチタスクの場合、異なるタスクで難易度や収束率が異なる ¤ メインタスクよりも補助タスクの⽅が簡単そう→早く収束しそう ¤ 補助タスクが先に最適解に到達してるのにマルチタスク学習を続けると、 過学習となってしまい、メインタスクに悪影響を与えることになる →タスクによって学習をhaltするtask-wise early stopping ¤ ⾃動的にタスクを停⽌する基準 Facial Landmark Detection by Deep Multi-task Learning 7 of the training process, the TCDCN is constrained by all tasks to avoid being trapped at a bad local minima. As training proceeds, certain auxiliary tasks are no longer beneficial to the main task after they reach their peak performance their learning process thus should be halted. Note that the regularization o↵ered by early stopping is di↵erent from weight regularization in Eq.(3). The latte globally helps to prevent over-fitting in each task through penalizing certain parameter configurations. In Section 4.2, we show that task-wise early stopping is critical for multi-task learning convergence even with weight regularization. Now we introduce a criterion to automatically determine when to stop learn ing an auxiliary task. Let Ea val and Ea tr be the values of the loss function of task a on the validation set and training set, respectively. We stop the task if its measure exceeds a threshold ✏ as below k · medt j=t kEa tr(j) Pt j=t k Ea tr(j) k · medt j=t kEa tr(j) · Ea val(t) minj=1..t Ea tr(j) a · minj=1..t Ea tr(j) > ✏, (5 where t denotes the current iteration and k controls a training strip of length k. The ‘med’ denotes the function for calculating median value. The first ter m in Eq.(5) represents the tendency of the training error. If the training erro drops rapidly within a period of length k, the value of the first term is small indicating that training can be continued as the task is still valuable; otherwise 閾値 訓練誤差の傾向 • 訓練データの⼀部kにおいて訓練誤差 が急激に落ちると値は⼩さくなる →⽌まらない 汎化誤差 • 訓練誤差に対する汎化誤差 • 汎化誤差と訓練誤差の差が⼤ きくなる→⽌まる
  16. 16. Learning procedure ¤ 最急降下法で求める is the importance coe cient of a-th task’s er gradient descent. Its magnitude reveals that m longer impact. This strategy achieves satisfac volution network given multiple tasks. Its sup in Section 4.2. Learning procedure: We have discussed w iliary task during training before it over-fit stochastic gradient descent to update the w the network. For example, the weight matri Wr = ⌘ @Er @Wr with ⌘ being the learning ra tion), and @Er @Wr = (yr i (Wr ) T xi)xT i . Also, th weights can be calculated in a similar manne For the filters in the lower layer, we compute loss error back following the back-propagatio "1 (Ws2 )T "2 @ (u1) @u1 "2 (Ws3 )T "3 @ (u2 @u2 where "l is the error at the shared represent (Wr )T xi] + P a2A(p(ya i |xi; Wa ) ya i )Wa , w derivatives. The errors of the lower layers a instance, "l 1 = (Wsl )T "l @ (ul 1 ) @ul 1 , where @ ( @u function. Then, the gradient of the filter is o ⌦ represents the receptive field of the filter. ing an auxiliary task. Let Ea val and Ea tr be the values of the loss function of task a on the validation set and training set, respectively. We stop the task if its measure exceeds a threshold ✏ as below k · medt j=t kEa tr(j) Pt j=t k Ea tr(j) k · medt j=t kEa tr(j) · Ea val(t) minj=1..t Ea tr(j) a · minj=1..t Ea tr(j) > ✏, (5) where t denotes the current iteration and k controls a training strip of length k. The ‘med’ denotes the function for calculating median value. The first ter- m in Eq.(5) represents the tendency of the training error. If the training error drops rapidly within a period of length k, the value of the first term is small, indicating that training can be continued as the task is still valuable; otherwise, the first term is large, then the task is more likely to be stopped. The second term measures the generalization error compared to the training error. The a is the importance coe cient of a-th task’s error, which can be learned through gradient descent. Its magnitude reveals that more important task tends to have longer impact. This strategy achieves satisfactory results for learning deep con- volution network given multiple tasks. Its superior performance is demonstrated in Section 4.2. Learning procedure: We have discussed when and how to switch o↵ an aux- iliary task during training before it over-fits. For each iteration, we perform stochastic gradient descent to update the weights of the tasks and filters of the network. For example, the weight matrix of the main task is updated by Wr = ⌘ @Er @Wr with ⌘ being the learning rate (⌘ = 0.003 in our implementa- tion), and @Er @Wr = (yr i (Wr ) T xi)xT i . Also, the derivative of the auxiliary task’s weights can be calculated in a similar manner as @Ea @Wa = (p(ya i |xi; Wa ) ya i )xi. For the filters in the lower layer, we compute the gradients by propagating the loss error back following the back-propagation strategy as "1 (Ws2 )T "2 @ (u1) @u1 "2 (Ws3 )T "3 @ (u2) @u2 ... (Wsl )T "l @ (ul 1) @ul 1 "l , (6) where "l is the error at the shared representation layer and "l = (Wr )T [yr i (Wr )T xi] + P a2A(p(ya i |xi; Wa ) ya i )Wa , which is the integration of all tasks’ derivatives. The errors of the lower layers are computed following Eq.(6). For instance, "l 1 = (Wsl )T "l @ (ul 1 ) @ul 1 , where @ (u) @u is the gradient of the activation function. Then, the gradient of the filter is obtained by @E @Wsl = "l xl 1 ⌦ , where ⌦ represents the receptive field of the filter. メインタスク 補助タスク ts magnitude reveals that more important task tends to have is strategy achieves satisfactory results for learning deep con- given multiple tasks. Its superior performance is demonstrated dure: We have discussed when and how to switch o↵ an aux- training before it over-fits. For each iteration, we perform t descent to update the weights of the tasks and filters of example, the weight matrix of the main task is updated by with ⌘ being the learning rate (⌘ = 0.003 in our implementa- (yr i (Wr ) T xi)xT i . Also, the derivative of the auxiliary task’s culated in a similar manner as @Ea @Wa = (p(ya i |xi; Wa ) ya i )xi. he lower layer, we compute the gradients by propagating the owing the back-propagation strategy as )T "2 @ (u1) @u1 "2 (Ws3 )T "3 @ (u2) @u2 ... (Wsl )T "l @ (ul 1) @ul 1 "l , (6) ror at the shared representation layer and "l = (Wr )T [yr i (p(ya i |xi; Wa ) ya i )Wa , which is the integration of all tasks’ rrors of the lower layers are computed following Eq.(6). For Wsl )T "l @ (ul 1 ) @ul 1 , where @ (u) @u is the gradient of the activation he gradient of the filter is obtained by @E @Wsl = "l xl 1 ⌦ , where volution network in Section 4.2. Learning proce iliary task during stochastic gradien the network. For Wr = ⌘ @Er @Wr tion), and @Er @Wr = weights can be ca For the filters in loss error back fo "1 (Ws where "l is the er (Wr )T xi] + P a2A derivatives. The instance, "l 1 = ( function. Then, t ⌦ represents the gradient descent. Its magnitude reveals tha longer impact. This strategy achieves satis volution network given multiple tasks. Its s in Section 4.2. Learning procedure: We have discussed iliary task during training before it over- stochastic gradient descent to update the the network. For example, the weight ma Wr = ⌘ @Er @Wr with ⌘ being the learning tion), and @Er @Wr = (yr i (Wr ) T xi)xT i . Also, weights can be calculated in a similar man For the filters in the lower layer, we compu loss error back following the back-propagat "1 (Ws2 )T "2 @ (u1) @u1 "2 (Ws3 )T "3 @ @ where "l is the error at the shared represe (Wr )T xi] + P a2A(p(ya i |xi; Wa ) ya i )Wa , derivatives. The errors of the lower layers instance, "l 1 = (Wsl )T "l @ (ul 1 ) @ul 1 , where @ バックプロパゲーション reveals that more important task tends to have ieves satisfactory results for learning deep con- tasks. Its superior performance is demonstrated discussed when and how to switch o↵ an aux- re it over-fits. For each iteration, we perform update the weights of the tasks and filters of weight matrix of the main task is updated by he learning rate (⌘ = 0.003 in our implementa- i)xT i . Also, the derivative of the auxiliary task’s milar manner as @Ea @Wa = (p(ya i |xi; Wa ) ya i )xi. we compute the gradients by propagating the k-propagation strategy as (Ws3 )T "3 @ (u2) @u2 ... (Wsl )T "l @ (ul 1) @ul 1 "l , (6) red representation layer and "l = (Wr )T [yr i ya i )Wa , which is the integration of all tasks’ wer layers are computed following Eq.(6). For ) , where @ (u) @u is the gradient of the activation
  17. 17. Experiments ¤ Network Structure ¤ Model training ¤ 学習するデータセット:10,000 outdoor face images from the web ¤ 移動とか回転、ズームはあまり気にしないで収集 ¤ テストデータ:AFLWとAFL ¤ Evaluation metrics ¤ 平均エラー率 ¤ 正解と推定したlandmarkの距離を計算し、⽬の間隔で正規化 ¤ 誤り率 ¤ 10%を越えると誤りと判断
  18. 18. the Effectiveness of Learning with Related Task ¤ AFLWで評価 ¤ 左が各landmarkのエラー率、右が全部のlandmarkの失敗率 ¤ 補助タスクによって確かにエラー率も失敗率も下がっている ¤ 全部の補助タスクを利⽤すると、失敗率を10%も改善できる ¤ poseが⼀番効いてるっぽい Facial Landmark Detection by Deep Multi-task Learning 9 6 8 10 12 left eye right eye nose left mouth corner right mouth corner meanerror(%) FLD FLD+gender FLD+glasses FLD+smile FLD+pose FLD+all 35.62 31.86 32.87 32.37 28.76 25.00 20 25 30 35 40 failurerate(%) Fig. 4. Comparison of di↵erent model variants of TCDCN: the mean error over di↵erent landmarks, and the overall failure rate. 4.1 Evaluating the E↵ectiveness of Learning with Related Task To examine the influence of related tasks, we evaluate five variants of the pro- posed model. In particular, the first variant is trained only on facial landmark detection. We train another four model variants on facial landmark detection along with the auxiliary task of recognizing ‘pose’, ‘gender’, ‘wearing glasses’,
  19. 19. FLD vs. FLD + smile smileがどのlandmarkで効果的かを検証 (a):⿐や⼝で効果がある ¤ smileは顔の下半分に該当するから (b):最終層の重みのピアソンの相関係数 ¤ ⼝と強い相関 10 Z. Zhang, P. Luo, C. C. Loy, and X. Tang 8 8.5 9 9.5 10 10.5 11 11.5 left eye right eye nose left mouth corner right mouth corner meanerror(%) FLD FLD+smile 0.11 0.32 0.17 0.22 0.40 left eye right eye nose left mouth corner right mouth corner correlation Landmarkdetectionweights (a) (b) Learned weights’ correlation with the weights of‘smiling’task Fig. 5. FLD vs. FLD+smile. The smiling attribute helps detection more on the nose and corners of mouth, than the centers of eyes, since ‘smiling’ mainly a↵ects the lower part of a face.
  20. 20. FLD vs. FLD + pose ポーズの効果を検証 (a):どのポーズでもエラー率は下がっている (b):正解の改善率で⾒ても、どのポーズでもよくなっている Fig. 5. FLD vs. FLD+smile. The smiling attribute helps detection more on the nose and corners of mouth, than the centers of eyes, since ‘smiling’ mainly a↵ects the lower part of a face. 0 0.5 1 1.5 2 2.5 3 left profile left frontal right right profle accuracyimprovement(%) (a) 5 10 15 20 left profile left frontal right right profle meanerror(%) FLD FLD+pose (b) Fig. 6. FLD vs. FLD+pose. (a) Mean error in di↵erent poses, and (b) Accuracy im- provement by the FLD+pose in di↵erent poses. weight vectors, which are learned to predict the positions of the mouth’s corners have high correlation with the weights of ‘smiling’ inference. This demonstrates that TCDCN implicitly learns relationship between tasks. FLD vs. FLD+pose: As observed in Figure 6(a), detection errors of FLD
  21. 21. The Benefits of Task-wise Early Stopping (a):task-wise early stoppingでかなりエラーが落ちている (b):訓練誤差・汎化誤差がearly stoppingで⼩さくなっている Facial Landmark Detection by Deep Multi- stop ‘glasses’ stop ‘gender’ stop ‘smile’ stop ‘pose’6 8 10 12 14 16 left eye right eye nose left mouth corner right mouth corner meanerror(%) FLD+all FLD+all with task-wise early-stopping Fig. 7. (a) Task-wise early stopping leads to substantially lower di↵erent landmarks. (b) Its benefit is also reflected on the trainin convergence rate. The error is measured in L2-norm with respec of the 10 coordinates values (normalized to [0,1]) for the 5 landm 4.3 Comparison with the Cascaded CNN [21] Although both the TCDCN and the cascaded CNN [21] a we show that the proposed model can achieve better detect significantly lower computational cost. We use the full mo the publicly available binary code of the cascaded CNN in t Landmark localization accuracy: Similar to Section 4.1
  22. 22. Comparison with the Cascaded CNN ¤ 訓練データを同じにしてAFLWでテスト ¤ 異なる点は、マルチタスク学習を利⽤しているかどうかという点 ¤ 4つのlandmarkでcascaded CNNを上回る ¤ 全体的にはcascaded CNNに勝っている that we use the same 10,000 training faces as in the cascaded CNN method. Thus the only di↵erence is that we exploit a multi-task learning approach. It is observed from Figure 8 that our method performs better in four out of five landmarks, and the overall accuracy is superior to that of cascaded CNN. (a) (b) 7 8 9 10 11 left eye right eye nose left mouth corner right mouth corner meanerror(%) cascaded CNN Ours 10 20 30 40 50 left eye right eye nose left mouth corner right mouth corner failurerate(%) Fig. 8. The proposed TCDCN vs. cascaded CNN [21]: (a) mean error over di↵erent landmarks and (b) the overall failure rate. Computational e ciency: Suppose the computation time of a 2D-convolution operation is ⌧, the total time cost for a CNN with L layers can be approximated by PL l=1 s2 l qlql 1⌧, where s2 is the 2D size of the input feature map for l-th layer, and q is the number of filters. The algorithm complexity of a CNN is thus O(s2 q2 ), directly related to the input image size and number of filters. Note that
  23. 23. Comparison with other State-of-the-art Methods ¤ AFLWでの結果 ¤ 他の既存研究の結果を全て上回っている ¤ AFWでの結果 ¤ AFLWと同様 12 Z. Zhang, P. Luo, C. C. Loy, and X. Tang 5 10 15 20 25 left eye right eye nose left mouth corner right mouth corner meanerror(%) TSPM ESR CDM Luxand RCPR SDM Ours 15.9 13.0 13.1 12.4 11.6 8.5 8.0 5 10 15 20 meanerror(%) 5 10 15 20 25 left eye right eye nose left mouth corner right mouth corner meanerror(%) 14.3 12.2 11.1 10.4 9.3 8.8 8.2 5 10 15 20 meanerror(%) AFLWAFW Fig. 9. Comparison with RCPR [3], TSPM [32], CDM [27], Luxand [18], and SDM [25] on AFLW [11] (the first row) and AFW [32] (the second row) datasets. The left sub- figures show the mean errors on di↵erent landmarks, while the right subfigures show the overall errors. multiple CNNs in di↵erent cascaded layers (23 CNNs in its implementation). Hence, TCDCN has much lower computational cost. The cascaded CNN requires
  24. 24. Comparison with other State-of-the-art Methods ⾊々な画像の結果 ¤ 1⾏⽬:メガネかけてる ¤ 2⾏⽬:ポーズのバリエーション ¤ 3⾏⽬: ¤ 1,2列⽬:光の当たり⽅が違う ¤ 3列⽬:画像の質が悪い ¤ 4,5列⽬:異なる表情 ¤ 6~8列⽬:間違った例(⾚が間違った部分) Facial Landmark Detection by Deep Multi-task Learning 13 0’ NS NG F30’ NS G F 60’ NS NG F 30’ S NG F-30’ NS NG F 0’ NS G M 60’ NS NG F -30’ NS G M-30’ S G M -30’ S NG F 0’ NS NG F 0’ S NG F -30’ NS NG M60’ S NG M60’ NS NG M 0’ NS NG M0’ S NG F 0’ S NG M 0’ NS NG M0’ NS NG M 0’ NS NG M -30’ NS NG F 0’ S NG F0’ NS NG F Fig. 10. Example detections by the proposed model on AFLW [11] and AFW [32] images. The labels below each image denote the tagging results for the related tasks: (0 , ±30 , ±60 ) for pose; S/NS = smiling/not-smiling; G/NG = with-glasses/without- glasses; M/F = male/female. Red rectangles indicate wrong tagging. 4.5 TCDCN for Robust Initialization This section shows that the TCDCN can be used to generate a good initialization to improve the state-of-the-art method, owing to its accuracy and e ciency. We take RCPR [3] as an example. Instead of drawing training samples randomly as initialization as did in [3], we initialize RCPR by first applying TCDCN on the
  25. 25. TCDCN for Robust Initialization ¤ TCDCNはよい初期化を得る⼿法としても利⽤できる ¤ 既存研究であるRCPRについて、TCDCNでの初期化をしたものとしな かったもので⽐較 (a):相対的な改善(改善後のエラー/元のエラー) (b):改善の可視化(上が普通のRCPR、下がTCDCNで初期化したRCPR) 14 Z. Zhang, P. Luo, C. C. Loy, and X. Tang 1 23 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 0 5 10 15 20 1 6 11 16 21 26 relativeimprovment(%) landmarks (a) (b) Fig. 11. Initialization with our five-landmark estimation for RCPR [3] on COFW dataset [3]. (a) shows the relative improvement on each landmark (relative improvement = reduced error original error ). (b) visualizes the improvement. The upper row depicts the results of RCPR [3], while the lower row shows the improved results by our initialization. heterogeneous but subtly correlated tasks, such as appearance attribute, expres- sion, demographic, and head pose. The proposed Tasks-Constrained DCN allows errors of related tasks to be back-propagated in deep hidden layers for construct- ing a shared representation to be relevant to the main task. We have shown that
  26. 26. Conclusion ¤ ヘテロだが相互に関連のあるタスクを同時に学習することで、よりロ バストなlandmark detectionができることを⽰した ¤ TCDCNは関連したタスクのエラーをバックプロパゲーションによっ て、共通した表現を学習できる ¤ task-wise early stoppingがモデルの収束を確実にするために重要 ¤ マルチタスク学習によって顔の状態にかなりロバストなモデルを作成 できた ¤ Future work:より密なlandmark detectionへの適⽤、他の画像認識 問題へのディープマルチタスク学習の適⽤
  27. 27. 感想 ¤ CNNのマルチタスク学習の⽅法がわかってよかった ¤ この⽅法は知らなかった(新しい?) ¤ CNN+線形識別器だったが、CNN+CNNでも良さそう ¤ 個⼈的には、論⽂の書き⽅が参考になりそうだった ¤ 新しい⼿法が多くて⾊々実験している点が今書いている論⽂とにている

×