SIGNATE
国立国会図書館の画像データレイアウト認識
1st place solution
coz.a
@coz_a_
Task
• Object Detection
• Metrics: mean IoU
• Dataset
• # of Images
• # of Boxes
train test
古典籍 1219 211
明治期以降刊行 1175 252
Total 2394 463
train
1_overall 2_handwritten 3_typography 4_illustration 5_stamp 6_headline 7_caption 8_textline
古典籍 1219 13851 9262 1119 369 - - -
明治期以降刊行 1175 - - 1207 78 3150 1462 60447
Total 2394 13851 9262 2326 447 3150 1462 60447
one-to-one
CNN Architecture
margin (*)
(1_overall)
keypoint heatmap
(category 2~8)
box size (*)
local offset
EfficientNet
(ImageNet pretrained)
BiFPN
image
[b, 3, h, w]
[b, 4]
[b, 2, h/4, w/4]
[b, 2, h/4, w/4]
[b, 7, h/4, w/4]
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks https://arxiv.org/abs/1905.11946
EfficientDet: Scalable and Efficient Object Detection https://arxiv.org/abs/1911.09070
Objects as Points https://arxiv.org/abs/1904.07850
(*) normalized by
input image width
category mask
[b, 7, 1, 1]
×
CenterNet
Margin Regression
古典籍:
[1, 1, 1, 1, 0, 0, 0]
明治期以降刊行:
[0, 0, 1, 1, 1, 1, 1]
Training Parameters
• 5-Fold CV
• Batch Size: 6 (2 GPU, GTX1080ti x 2)
• Epochs: 104
• Optimizer: RAdam, LR=1.2e-3 (x0.1 at epoch=[64, 96])
• Data Augmentation:
• Random Crop & Scale
• Gray Scale / Thresholding (cv2.adaptiveThreshold)
• Random Rotate (±0.2degree)
• Cutout (side edge)
• Loss Function:
• keypoint heatmap: Focal Loss (weight=1.0)
• box size: L1 Loss (weight=5.0)
• local offset : L1 Loss (weight=0.2)
• margin : L1 Loss (weight=12.5)
On the Variance of the Adaptive Learning Rate and Beyond https://arxiv.org/abs/1908.03265
Prediction
Pad & Scale
Detect
5model x 5fold
768x576
896x672
1024x768
1152x864
1280x960
640x480
1408x1056
Φ=4, train on 724x576
Φ=3, train on 896x672
Φ=2, train on 1024x768
Φ=1, train on 1152x864
Φ=0, train on 1280x960
Input
Boxes
3scale x 5model x 5fold
Weighted
Boxes Fusion
iou_threshold=0.38
Output
Weighted Boxes Fusion: ensembling boxes for object detection models https://arxiv.org/abs/1910.13302
(*) Φ is model scaling parameter.
ref. EfficientDet: Scalable and Efficient Object Detection
(*)
CNN -> build boxes
-> score threshold=0.005
-> nms(iou threshold=0.18)
Score History
public private
single model (Φ=4), 5-Fold CV,
NMS ensemble
0.79143 0.82140
single model (Φ=4), 5-Fold CV, TTA (3 scale),
NMS ensemble
0.80315 0.82782
3 model (Φ=[0, 2, 4]), 5-Fold CV, TTA (3 scale),
NMS ensemble
0.80468 0.82961
3 model (Φ=[0, 2, 4]), 5-Fold CV, TTA (3 scale),
WBF ensemble
0.82226 0.84791
5 model (Φ=[0, 1, 2, 3, 4]), 5-Fold CV, TTA (3 scale),
WBF ensemble
0.82340 0.84978

SIGNATE 国立国会図書館の画像データレイアウト認識 1st place solution

  • 1.
  • 2.
    Task • Object Detection •Metrics: mean IoU • Dataset • # of Images • # of Boxes train test 古典籍 1219 211 明治期以降刊行 1175 252 Total 2394 463 train 1_overall 2_handwritten 3_typography 4_illustration 5_stamp 6_headline 7_caption 8_textline 古典籍 1219 13851 9262 1119 369 - - - 明治期以降刊行 1175 - - 1207 78 3150 1462 60447 Total 2394 13851 9262 2326 447 3150 1462 60447 one-to-one
  • 3.
    CNN Architecture margin (*) (1_overall) keypointheatmap (category 2~8) box size (*) local offset EfficientNet (ImageNet pretrained) BiFPN image [b, 3, h, w] [b, 4] [b, 2, h/4, w/4] [b, 2, h/4, w/4] [b, 7, h/4, w/4] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks https://arxiv.org/abs/1905.11946 EfficientDet: Scalable and Efficient Object Detection https://arxiv.org/abs/1911.09070 Objects as Points https://arxiv.org/abs/1904.07850 (*) normalized by input image width category mask [b, 7, 1, 1] × CenterNet Margin Regression 古典籍: [1, 1, 1, 1, 0, 0, 0] 明治期以降刊行: [0, 0, 1, 1, 1, 1, 1]
  • 4.
    Training Parameters • 5-FoldCV • Batch Size: 6 (2 GPU, GTX1080ti x 2) • Epochs: 104 • Optimizer: RAdam, LR=1.2e-3 (x0.1 at epoch=[64, 96]) • Data Augmentation: • Random Crop & Scale • Gray Scale / Thresholding (cv2.adaptiveThreshold) • Random Rotate (±0.2degree) • Cutout (side edge) • Loss Function: • keypoint heatmap: Focal Loss (weight=1.0) • box size: L1 Loss (weight=5.0) • local offset : L1 Loss (weight=0.2) • margin : L1 Loss (weight=12.5) On the Variance of the Adaptive Learning Rate and Beyond https://arxiv.org/abs/1908.03265
  • 5.
    Prediction Pad & Scale Detect 5modelx 5fold 768x576 896x672 1024x768 1152x864 1280x960 640x480 1408x1056 Φ=4, train on 724x576 Φ=3, train on 896x672 Φ=2, train on 1024x768 Φ=1, train on 1152x864 Φ=0, train on 1280x960 Input Boxes 3scale x 5model x 5fold Weighted Boxes Fusion iou_threshold=0.38 Output Weighted Boxes Fusion: ensembling boxes for object detection models https://arxiv.org/abs/1910.13302 (*) Φ is model scaling parameter. ref. EfficientDet: Scalable and Efficient Object Detection (*) CNN -> build boxes -> score threshold=0.005 -> nms(iou threshold=0.18)
  • 6.
    Score History public private singlemodel (Φ=4), 5-Fold CV, NMS ensemble 0.79143 0.82140 single model (Φ=4), 5-Fold CV, TTA (3 scale), NMS ensemble 0.80315 0.82782 3 model (Φ=[0, 2, 4]), 5-Fold CV, TTA (3 scale), NMS ensemble 0.80468 0.82961 3 model (Φ=[0, 2, 4]), 5-Fold CV, TTA (3 scale), WBF ensemble 0.82226 0.84791 5 model (Φ=[0, 1, 2, 3, 4]), 5-Fold CV, TTA (3 scale), WBF ensemble 0.82340 0.84978