Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Learning loss for active learning

226 views

Published on

The performance of deep neural networks improves with more annotated data. The problem is that the budget for annotation is limited. One solution to this is active learning, where a model asks human to annotate data that it perceived as uncertain. A variety of recent methods have been proposed to apply active learning to deep networks but most of them are either designed specific for their target tasks or computationally inefficient for large networks. In this paper, we propose a novel active learning method that is simple but task-agnostic, and works efficiently with the deep networks. We attach a small parametric module, named “loss prediction module,” to a target network, and learn it to predict target losses of unlabeled inputs. Then, this module can suggest data that the target model is likely to produce a wrong prediction. This method is task-agnostic as networks are learned from a single loss regardless of target tasks. We rigorously validate our method through image classification, object detection, and human pose estimation, with the recent network architectures. The results demonstrate that our method consistently outperforms the previous methods over the tasks

Published in: Technology
  • DOWNLOAD THAT BOOKS/FILE INTO AVAILABLE FORMAT - (Unlimited) ......................................................................................................................... ......................................................................................................................... Download FULL PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... accessibility Books Library allowing access to top content, including thousands of title from favorite author, plus the ability to read or download a huge selection of books for your pc or smartphone within minutes Christian, Classics, Comics, Contemporary, Cookbooks, Art, Biography, Business, Chick Lit, Children's, Manga, Memoir, Music, Science, Science Fiction, Self Help, History, Horror, Humor And Comedy, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Learning loss for active learning

  1. 1. Learning Loss for Active Learning Donggeun Yoo In So Kweon CVPR 2019 (Oral presentation) Lunit KAIST
  2. 2. Introduction •Very important for deep learning •It is not questionable that more data still improves network performance [Mahajan et al., ECCV’18] 천만~10억장
  3. 3. Introduction •Problem: Limited budget for annotation Horse=1 $ $$ $$$
  4. 4. Introduction •Problem: Limited budget for annotation •Disease-level annotations for medical images: super-expensive $$$$$ Horse=1 $
  5. 5. Active Learning Labeled Training
  6. 6. Active Learning Unlabeled pool Labeled Inference
  7. 7. Active Learning Unlabeled pool Labeled Inference Labeling If uncertain,
  8. 8. Active Learning Unlabeled pool Labeled Inference Labeling Training If uncertain,
  9. 9. Active Learning Unlabeled pool Labeled Inference Labeling Training If uncertain,
  10. 10. Active Learning If uncertain, The key of active learning is how to measure the uncertainty.
  11. 11. Active Learning: Limitations • Heuristic approach • Highest entropy [Joshi et al., CVPR’09] • Distance to decision boundaries [Tong & Koller, JMLR’01] (−) Task-specific design • Ensemble approach [Freund et al., ML’97], [Beluch et al., CVPR’18] (−) Not scale to large CNNs and data • Bayesian approach • Expected error [Roy & McCallum, ICML’01]/model [Kapoor et al., ICCV’07] • Bayesian inference by dropouts [Gal & Ghahramani ICML’17] (−) Not scale to large data and CNNs [Sener & Savarese, ICLR’18] • Distribution approach • Density-based [Liu & Ferrari, ICCV’17], diversity-based [Sener & Savarese, ICLR’18] (−) Task-specific design
  12. 12. *Entropy • An information-theoretic measure that represents the information amount needed to “encode” a distribution. • The use of entropy in active learning • Dense prediction (0.33, 0.33, 0.33) → maximum • Sparse prediction (1.00, 0.00, 0.00) → minimum
  13. 13. *Entropy • An information-theoretic measure that represents the information amount needed to “encode” a distribution. • The use of entropy in active learning • Dense prediction (0.33, 0.33, 0.33) → maximum • Sparse prediction (1.00, 0.00, 0.00) → minimum (+) Very simple but works well (also in deep networks) (−) Specific for classification problem
  14. 14. Active Learning: Limitations • Heuristic approach • Highest entropy [Joshi et al., CVPR’09] • Distance to decision boundaries [Tong & Koller, JMLR’01] (−) Task-specific design • Ensemble approach [Freund et al., ML’97], [Beluch et al., CVPR’18] (−) Not scale to large CNNs and data • Bayesian approach • Expected error [Roy & McCallum, ICML’01]/model [Kapoor et al., ICCV’07] • Bayesian inference by dropouts [Gal & Ghahramani ICML’17] (−) Not scale to large data and CNNs [Sener & Savarese, ICLR’18] • Distribution approach • Density-based [Liu & Ferrari, ICCV’17], diversity-based [Sener & Savarese, ICLR’18] (−) Task-specific design
  15. 15. *Bayesian Inference • Training • Dropout layer inserted to every convolution layer • Inference • N feed forwards → N predictions • Uncertainty = variance between predictions
  16. 16. *Bayesian Inference • Training • Dropout layer inserted to every convolution layer (−) Super slow convergence → impractical for current deep nets • Inference • N feed forwards → N predictions • Uncertainty = variance between predictions (−) Computationally expensive
  17. 17. Active Learning: Limitations • Heuristic approach • Highest entropy [Joshi et al., CVPR’09] • Distance to decision boundaries [Tong & Koller, JMLR’01] (−) Task-specific design • Ensemble approach [Freund et al., ML’97], [Beluch et al., CVPR’18] (−) Not scale to large CNNs and data • Bayesian approach • Expected error [Roy & McCallum, ICML’01]/model [Kapoor et al., ICCV’07] • Bayesian inference by dropouts [Gal & Ghahramani ICML’17] (−) Not scale to large data and CNNs [Sener & Savarese, ICLR’18] • Distribution approach • Density-based [Liu & Ferrari, ICCV’17], diversity-based [Sener & Savarese, ICLR’18] (−) Task-specific design
  18. 18. *Diversity: Core-set Distribution of unlabeled pool
  19. 19. *Diversity: Core-set 𝛿 Distribution of unlabeled pool { } is 𝛿-cover of { }
  20. 20. *Diversity: Core-set 𝛿 Distribution of unlabeled pool { } is 𝛿-cover of { } 𝑥 = min {𝑥} 𝛿 Optimization problem
  21. 21. *Diversity: Core-set (+) can be task-agnostic as it only depends on feature space (−) not considering ”hard” examples near the decision boundaries (−) Expensive optimization for large pool
  22. 22. Active Learning: Limitations • Heuristic approach • Highest entropy [Joshi et al., CVPR’09] • Distance to decision boundaries [Tong & Koller, JMLR’01] (−) Task-specific design • Ensemble approach [Freund et al., ML’97], [Beluch et al., CVPR’18] (−) Not scale to large CNNs and data • Bayesian approach • Expected error [Roy & McCallum, ICML’01]/model [Kapoor et al., ICCV’07] • Bayesian inference by dropouts [Gal & Ghahramani ICML’17] (−) Not scale to large CNNs and data [Sener & Savarese, ICLR’18] • Distribution approach • Density-based [Liu & Ferrari, ICCV’17], diversity-based [Sener & Savarese, ICLR’18] (−) Not considering hard examples
  23. 23. Active Learning: Our approach • Active learning by learning loss • Attach a “loss prediction module” to a target network • Learn the module to predict the loss Unlabeled pool ⋯Predicted losses Labeled training set Human oracles annotate top-𝐾 data points
  24. 24. Active Learning: Our approach • Requirements • Task-agnostic method • Not heuristic, learning-based • Scalable to state-of-the-art networks and large data
  25. 25. Active Learning by Learning Loss Model Loss prediction module Input Target prediction Loss prediction Target GT Target loss Loss-prediction loss
  26. 26. Active Learning by Learning Loss Model Loss prediction module Input Target prediction Loss prediction Target GT Target loss Loss-prediction loss Multi-task learning
  27. 27. Active Learning by Learning Loss Model Loss prediction module Input Target prediction Loss prediction Target GT Target loss Loss-prediction loss (+) Applicable to • any network and data • any tasks (+) Nearly zero cost
  28. 28. Active Learning by Learning Loss Model Loss prediction module Input Target prediction Loss prediction Target GT Target loss Loss-prediction loss (+) Applicable to • any network and data • any tasks (+) Nearly zero cost 𝑥 ො𝑦 𝑦 መ𝑙 𝑙 𝐿loss መ𝑙, 𝑙
  29. 29. Active Learning by Learning Loss •The loss for loss-prediction 𝐿loss መ𝑙, 𝑙 •Mean square error? 𝐿𝑙𝑜𝑠𝑠 መ𝑙, 𝑙 = መ𝑙 − 𝑙 2
  30. 30. Active Learning by Learning Loss •The loss for loss-prediction 𝐿loss መ𝑙, 𝑙 •Mean square error? → target task loss 𝑙 reduced as training progresses 𝐿𝑙𝑜𝑠𝑠 መ𝑙, 𝑙 = መ𝑙 − 𝑙 2 Scale changes
  31. 31. Active Learning by Learning Loss •The loss for loss-prediction 𝐿loss መ𝑙, 𝑙 •To ignore scale changes of 𝑙, we use a ranking loss
  32. 32. Active Learning by Learning Loss •The loss for loss-prediction 𝐿loss መ𝑙, 𝑙 •To ignore scale changes of 𝑙, we use a ranking loss as 𝐿loss መ𝑙𝑖, መ𝑙𝑗, 𝑙𝑖, 𝑙𝑗 = max 0, −𝟏 𝑙𝑖, 𝑙𝑗 ⋅ መ𝑙𝑖 − መ𝑙𝑗 + 𝜉 where 𝟏 𝑙𝑖, 𝑙𝑗 = ቊ +1, if 𝑙𝑖 > 𝑙𝑗 −1, otherwise A pair of predicted losses A pair of real losses Margin (=1)
  33. 33. Active Learning by Learning Loss •Given a mini-batch B, the total loss is defined as 1 B ෍ 𝑥,𝑦 ∈B 𝐿task ො𝑦, 𝑦 + 𝜆 1 B ⋅ ෍ 𝑥 𝑖,𝑦 𝑖,𝑥 𝑗,𝑦 𝑗 ∈B 𝐿loss መ𝑙𝑖, መ𝑙𝑗, 𝑙𝑖, 𝑙𝑗 where 𝑙𝑖 = 𝐿task ො𝑦𝑖, 𝑦𝑖 Target task Loss prediction A pair 𝑖, 𝑗 within a mini-batch B
  34. 34. Active Learning by Learning Loss •MSE loss VS. Ranking loss MSE ResNet-18 CIFAR-10
  35. 35. Active Learning by Learning Loss •MSE loss VS. Ranking loss MSE Ranking ResNet-18 CIFAR-10
  36. 36. Active Learning by Learning Loss •Loss prediction module Target model Mid- block Mid- block Mid- block Out block Target prediction FC Loss predictionConcat.
  37. 37. Active Learning by Learning Loss •Loss prediction module Enough convolutions Mid- block Mid- block Mid- block Out block Target prediction FC Loss predictionConcat. Convolved features
  38. 38. Active Learning by Learning Loss •Loss prediction module Enough convolutions Mid- block Mid- block Mid- block Out block Target prediction FC Loss predictionConcat. Backprop. to convs
  39. 39. Active Learning by Learning Loss •Loss prediction module Enough convolutions • The convolutions would be learned by the loss prediction loss as well as the target loss • Sufficiently large receptive field size
  40. 40. Active Learning by Learning Loss •Loss prediction module Enough convolutions • The convolutions would be learned by the loss prediction loss as well as the target loss • Sufficiently large receptive field size → Don’t need more convolutions, we just focus on merging the multiple features
  41. 41. Active Learning by Learning Loss •Loss prediction module Target model FC GAP FC ReLU GAP FC ReLU GAP FC ReLU Loss prediction Mid- block Mid- block Mid- block Out block Target prediction Concat. (+) very efficient as GAP reduces the feature dimension
  42. 42. Active Learning by Learning Loss •Loss prediction module Target model Target model FC Loss prediction Mid- block Mid- block Mid- block Out block Target prediction Concat. Conv BN ReLU GAP FC ReLU Conv BN ReLU GAP FC ReLU Conv BN ReLU GAP FC ReLU : Added layer
  43. 43. Active Learning by Learning Loss •Loss prediction module More convolutions VS. Just FC ResNet-18 CIFAR-10
  44. 44. Active Learning by Learning Loss •Loss prediction module Target model FC GAP FC ReLU GAP FC ReLU GAP FC ReLU Loss prediction Mid- block Mid- block Mid- block Out block Target prediction Concat.
  45. 45. Experiments (1) •To validate “task-agnostic” + “state-of-the-art architectures” Classification Task Image classification Data CIFAR-10 Net ResNet-18 [He et al., CVPR’16]
  46. 46. Experiments (1) •To validate “task-agnostic” + “state-of-the-art architectures” Classification Classification + regression Task Image classification Object detection Data CIFAR-10 PASCAL VOC 2007+2012 Net ResNet-18 [He et al., CVPR’16] SSD [Liu et al., ECCV’16]
  47. 47. Experiments (1) •To validate “task-agnostic” + “state-of-the-art architectures” Classification Classification + regression Regression Task Image classification Object detection Human pose estimation Data CIFAR-10 PASCAL VOC 2007+2012 MPII Net ResNet-18 [He et al., CVPR’16] SSD [Liu et al., ECCV’16] Stacked Hourglass Networks [Newell et al., ECCV’16]
  48. 48. Results •Image classification over CIFAR 10 FC GAP FC ReLU Loss prediction Concat. GAP FC ReLU GAP FC ReLU GAP FC ReLU ResNet-18 [He et al., CVPR’16] Target prediction 512×4×4 256×8×8 64×32×32 128×16×16 128 128 128 128 512
  49. 49. Results •Image classification over CIFAR 10 Loss prediction performance
  50. 50. Results •Image classification over CIFAR 10 (mean of 5 trials) [Joshi, CVPR’09]→ [Sener et al., ICLR’18]→
  51. 51. Results •Image classification over CIFAR 10 (mean of 5 trials) [Joshi, CVPR’09]→ [Sener et al., ICLR’18]→ +3.37%
  52. 52. Results •Image classification over CIFAR 10 (mean of 5 trials) [Joshi, CVPR’09]→ [Sener et al., ICLR’18]→ +3.37% Data selection VS. Architecture Data selection by active learning → +3.37% DenseNet121[Huang et al.] − ResNet18 → +2.02%
  53. 53. Results •Object detection SSD (ImageNet pre-trained) [Liu et al., ECCV’16] FC Loss prediction Concat. GAP FC ReLU GAP FC ReLU GAP FC ReLU GAP FC ReLU GAP FC ReLU GAP FC ReLU Target prediction 512×38×38 1024×19×19 512×10×10 256×5×5 256×3×3 256×1×1 128 768
  54. 54. Results •Object detection over PASCAL VOC 07+12 Loss prediction performance
  55. 55. Results •Object detection on PASCAL VOC 07+12 (mean of 3 trials) [Joshi, CVPR’09]→ [Sener et al., ICLR’18]→
  56. 56. Results •Object detection on PASCAL VOC 07+12 (mean of 3 trials) [Joshi, CVPR’09]→ [Sener et al., ICLR’18]→ +2.21%
  57. 57. Results •Object detection on PASCAL VOC 07+12 (mean of 3 trials) [Joshi, CVPR’09]→ [Sener et al., ICLR’18]→ +2.21% Data selection VS. Architecture Data selection by active learning → +2.21% YOLOv2[Redmon et al.] − SSD → +1.80%
  58. 58. Results •Human pose estimation over MPII dataset Stacked Hourglass Network [Newell et al., ECCV’16] FC GAP FC ReLU Loss prediction Concat. GAP FC ReLU GAP FC ReLU GAP FC ReLU Target prediction An hourglass256×64×64 256×64×64 256×64×64 256×64×64 128 128 128 128 1024
  59. 59. Results •Human pose estimation over MPII dataset Loss prediction performance
  60. 60. [Joshi, CVPR’09]→ [Sener et al., ICLR’18]→ Results •Human pose estimation over MPII dataset (mean of 3 trials)
  61. 61. [Joshi, CVPR’09]→ [Sener et al., ICLR’18]→ Results •Human pose estimation over MPII dataset (mean of 3 trials) +1.84%
  62. 62. [Joshi, CVPR’09]→ [Sener et al., ICLR’18]→ Results •Human pose estimation over MPII dataset (mean of 3 trials) +1.84% Data selection VS. Number of stacks Data selection by active learning → +1.84% 8-stacked − 2-stacked → +0.25%
  63. 63. Results •Entropy VS predicted loss over MPII dataset MSE loss MSE loss
  64. 64. Experiments (2) •To validate “active domain adaptation”, Dataset Data stats Active learning Source domain MNIST #train:60k #test: 10k Use 60k as an initial labeled pool Target domain MNIST + background #train: 12k #test: 50k Add 1k for each cycle
  65. 65. Results •Image classification over MNIST *https://github.com/pytorch/examples/tree/master/mnist FC GAP FC ReLU Loss prediction Concat. GAP FC ReLU GAP FC ReLU PyTorch MNIST model* Target prediction Conv ReLU Conv ReLU FC ReLU FC Image 10×12×12 20×4×4 50 64 64 64 192
  66. 66. Results •Domain adaptation from MNIST to MNIST+background •Loss prediction performance
  67. 67. Results •Domain adaptation from MNIST to MNIST+background •Target domain performance [Joshi, CVPR’09]→ [Sener et al., ICLR’18]→  Feature space overfitted to source domain
  68. 68. Results •Domain adaptation from MNIST to MNIST+background •Target domain performance [Joshi, CVPR’09]→ [Sener et al., ICLR’18]→  Feature space overfitted to source domain +1.20%
  69. 69. Results •Domain adaptation from MNIST to MNIST+background •Target domain performance [Joshi, CVPR’09]→ [Sener et al., ICLR’18]→  Feature space overfitted to source domain +1.20% Data selection VS. Architecture Data selection by active learning → +1.20% WideResNet14 − PytorchMNIST(4 layers) → +2.85%
  70. 70. Conclusion •Introduced a novel active learning method that is • Works well with current deep networks • Task-agnostic •Verified with • Three major visual recognition tasks • Three popular network architectures
  71. 71. Conclusion •Introduced a novel active learning method that is • Works well with current deep networks • Task-agnostic •Verified with • Three major visual recognition tasks • Three popular network architectures “ ” Pick more important data, and get better performance!

×