The document presents ConvNeXt, a pure convolutional neural network architecture that achieves performance on par with state-of-the-art vision transformers. The authors modernized ResNet architectures with techniques from vision transformers like increased training epochs, data augmentation, and larger kernel sizes. This resulted in ConvNeXt surpassing Swin transformers on various tasks like image classification, object detection, and segmentation while retaining the simplicity and efficiency of convolutional networks. The study challenges beliefs that vision transformers are inherently more accurate and efficient than convolutional networks.
#PR12 #PR366
안녕하세요 논문 읽기 모임 PR-12의 366번째 논문리뷰입니다.
올해가 AlexNet이 나온지 10주년이 되는 해네요.
AlexNet이 2012년에 혜성처럼 등장한 이후, Solve computer vision problem = Use CNN이 공식처럼 사용되던 2010년대가 가고
2020년대 들어서 ViT의 등장을 시작으로 Transformer 기반의 network들이 CNN의 자리를 위협하고 상당부분 이미 뺏어간 상황입니다.
2020년대에 CNN의 가야할 길은 어디일까요?
Inductive bias가 적은 Transformer가 대용량의 데이터로 학습하면 항상 CNN보다 더 낫다는 건 진실일까요?
이 논문에서는 2020년대를 위한 CNN이라는 제목으로 ConvNeXt라는 새로운(?) architecture를 제안합니다.
사실 새로운 건 없고 그동안 있었던 것들과 Transformer에서 적용한 것들을 copy해와서 CNN에 적용해보았는데요,
Transformer보다 성능도 좋고 속도도 빠른 결과가 나왔다고 합니다.
결과에 대해서 약간의 논란이 twitter 상에서 나오고 있는데 이 부분 포함해서 자세한 내용은 영상을 통해서 보실 수 있습니다.
늘 재밌게 봐주시고 좋아요 댓글 구독 해주시는 분들께 감사드립니다 :)
논문링크: https://arxiv.org/abs/2201.03545
영상링크: https://youtu.be/Mw7IhO2uBGc
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksJinwon Lee
TensorFlow-KR 논문읽기모임 PR12 169번째 논문 review입니다.
이번에 살펴본 논문은 Google에서 발표한 EfficientNet입니다. efficient neural network은 보통 mobile과 같은 제한된 computing power를 가진 edge device를 위한 작은 network 위주로 연구되어왔는데, 이 논문은 성능을 높이기 위해서 일반적으로 network를 점점 더 키워나가는 경우가 많은데, 이 때 어떻게 하면 더 효율적인 방법으로 network을 키울 수 있을지에 대해서 연구한 논문입니다. 자세한 내용은 영상을 참고해주세요
논문링크: https://arxiv.org/abs/1905.11946
영상링크: https://youtu.be/Vhz0quyvR7I
Convolutional Neural Networks (ConvNets) have been at the forefront of deep learning and computer vision tasks in the 2020s. These networks have undergone several advancements and improvements in recent years. Here's an overview of some key components and trends in ConvNets for the 2020s:
Architectural advancements: ConvNets have seen the development of more sophisticated architectures. One notable example is the introduction of residual connections, as seen in the ResNet architecture. Residual connections alleviate the vanishing gradient problem and enable the training of very deep networks.
Attention mechanisms: Inspired by the success of attention mechanisms in natural language processing tasks, ConvNets have incorporated attention mechanisms into their architectures. These mechanisms allow networks to focus on specific regions or features, enhancing their discriminative power. Popular attention mechanisms include spatial attention and channel attention.
Efficient architectures: With the increasing demand for deploying ConvNets on resource-constrained devices, there has been a focus on designing efficient architectures. Models like MobileNet and EfficientNet have been developed, which use depth-wise convolutions, squeeze-and-excitation modules, and neural architecture search techniques to reduce the computational cost while maintaining competitive performance.
Self-supervised learning: ConvNets have benefited from advancements in self-supervised learning techniques. By leveraging unlabeled data, models can learn useful representations through pretext tasks. These pre-trained models can then be fine-tuned on specific downstream tasks, leading to improved performance even with limited labeled data.
Transfer learning: Transfer learning has become a standard practice in ConvNets. Models pre-trained on large-scale datasets, such as ImageNet, serve as a starting point for various computer vision tasks. By leveraging the learned representations, transfer learning enables faster convergence and better generalization on new tasks, even with smaller labeled datasets.
Generative models: ConvNets have been used to develop powerful generative models, such as generative adversarial networks (GANs) and variational autoencoders (VAEs). GANs have been employed for tasks like image synthesis, super-resolution, and image-to-image translation. VAEs have been utilized for tasks like image generation, anomaly detection, and data augmentation.
Interpretability and explainability: As ConvNets have become more complex, the need for interpretability and explainability has also grown. Techniques like Grad-CAM (Gradient-weighted Class Activation Mapping) and attention maps help visualize which regions of an input image are important for the network's predictions. These approaches aid in understanding the decision-making process of ConvNets.
Continual learning: ConvNets have faced challenges in adapting to new tasks without forgetting previously learned knowledge.
Explores the type of structure learned by Convolutional Neural Networks, the applications where they're most valuable and a number of appropriate mental models for understanding deep learning.
PR-297: Training data-efficient image transformers & distillation through att...Jinwon Lee
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 297번째 리뷰입니다
어느덧 PR-12 시즌 3의 끝까지 논문 3편밖에 남지 않았네요.
시즌 3가 끝나면 바로 시즌 4의 새 멤버 모집이 시작될 예정입니다. 많은 관심과 지원 부탁드립니다~~
(멤버 모집 공지는 Facebook TensorFlow Korea 그룹에 올라올 예정입니다)
오늘 제가 리뷰한 논문은 Facebook의 Training data-efficient image transformers & distillation through attention 입니다.
Google에서 나왔던 ViT논문 이후에 convolution을 전혀 사용하지 않고 오직 attention만을 이용한 computer vision algorithm에 어느때보다 관심이 높아지고 있는데요
이 논문에서 제안한 DeiT 모델은 ViT와 같은 architecture를 사용하면서 ViT가 ImageNet data만으로는 성능이 잘 안나왔던 것에 비해서
Training 방법 개선과 새로운 Knowledge Distillation 방법을 사용하여 mageNet data 만으로 EfficientNet보다 뛰어난 성능을 보여주는 결과를 얻었습니다.
정말 CNN은 이제 서서히 사라지게 되는 것일까요? Attention이 computer vision도 정복하게 될 것인지....
개인적으로는 당분간은 attention 기반의 CV 논문이 쏟아질 거라고 확신하고, 또 여기에서 놀라운 일들이 일어날 수 있을 거라고 생각하고 있습니다
CNN은 10년간 많은 연구를 통해서 발전해왔지만, transformer는 이제 CV에 적용된 지 얼마 안된 시점이라서 더 기대가 크구요,
attention이 inductive bias가 가장 적은 형태의 모델이기 때문에 더 놀라운 이들을 만들 수 있을거라고 생각합니다
얼마 전에 나온 open AI의 DALL-E도 그 대표적인 예라고 할 수 있을 것 같습니다. Transformer의 또하나의 transformation이 궁금하신 분들은 아래 영상을 참고해주세요
영상링크: https://youtu.be/DjEvzeiWBTo
논문링크: https://arxiv.org/abs/2012.12877
#PR12 #PR366
안녕하세요 논문 읽기 모임 PR-12의 366번째 논문리뷰입니다.
올해가 AlexNet이 나온지 10주년이 되는 해네요.
AlexNet이 2012년에 혜성처럼 등장한 이후, Solve computer vision problem = Use CNN이 공식처럼 사용되던 2010년대가 가고
2020년대 들어서 ViT의 등장을 시작으로 Transformer 기반의 network들이 CNN의 자리를 위협하고 상당부분 이미 뺏어간 상황입니다.
2020년대에 CNN의 가야할 길은 어디일까요?
Inductive bias가 적은 Transformer가 대용량의 데이터로 학습하면 항상 CNN보다 더 낫다는 건 진실일까요?
이 논문에서는 2020년대를 위한 CNN이라는 제목으로 ConvNeXt라는 새로운(?) architecture를 제안합니다.
사실 새로운 건 없고 그동안 있었던 것들과 Transformer에서 적용한 것들을 copy해와서 CNN에 적용해보았는데요,
Transformer보다 성능도 좋고 속도도 빠른 결과가 나왔다고 합니다.
결과에 대해서 약간의 논란이 twitter 상에서 나오고 있는데 이 부분 포함해서 자세한 내용은 영상을 통해서 보실 수 있습니다.
늘 재밌게 봐주시고 좋아요 댓글 구독 해주시는 분들께 감사드립니다 :)
논문링크: https://arxiv.org/abs/2201.03545
영상링크: https://youtu.be/Mw7IhO2uBGc
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksJinwon Lee
TensorFlow-KR 논문읽기모임 PR12 169번째 논문 review입니다.
이번에 살펴본 논문은 Google에서 발표한 EfficientNet입니다. efficient neural network은 보통 mobile과 같은 제한된 computing power를 가진 edge device를 위한 작은 network 위주로 연구되어왔는데, 이 논문은 성능을 높이기 위해서 일반적으로 network를 점점 더 키워나가는 경우가 많은데, 이 때 어떻게 하면 더 효율적인 방법으로 network을 키울 수 있을지에 대해서 연구한 논문입니다. 자세한 내용은 영상을 참고해주세요
논문링크: https://arxiv.org/abs/1905.11946
영상링크: https://youtu.be/Vhz0quyvR7I
Convolutional Neural Networks (ConvNets) have been at the forefront of deep learning and computer vision tasks in the 2020s. These networks have undergone several advancements and improvements in recent years. Here's an overview of some key components and trends in ConvNets for the 2020s:
Architectural advancements: ConvNets have seen the development of more sophisticated architectures. One notable example is the introduction of residual connections, as seen in the ResNet architecture. Residual connections alleviate the vanishing gradient problem and enable the training of very deep networks.
Attention mechanisms: Inspired by the success of attention mechanisms in natural language processing tasks, ConvNets have incorporated attention mechanisms into their architectures. These mechanisms allow networks to focus on specific regions or features, enhancing their discriminative power. Popular attention mechanisms include spatial attention and channel attention.
Efficient architectures: With the increasing demand for deploying ConvNets on resource-constrained devices, there has been a focus on designing efficient architectures. Models like MobileNet and EfficientNet have been developed, which use depth-wise convolutions, squeeze-and-excitation modules, and neural architecture search techniques to reduce the computational cost while maintaining competitive performance.
Self-supervised learning: ConvNets have benefited from advancements in self-supervised learning techniques. By leveraging unlabeled data, models can learn useful representations through pretext tasks. These pre-trained models can then be fine-tuned on specific downstream tasks, leading to improved performance even with limited labeled data.
Transfer learning: Transfer learning has become a standard practice in ConvNets. Models pre-trained on large-scale datasets, such as ImageNet, serve as a starting point for various computer vision tasks. By leveraging the learned representations, transfer learning enables faster convergence and better generalization on new tasks, even with smaller labeled datasets.
Generative models: ConvNets have been used to develop powerful generative models, such as generative adversarial networks (GANs) and variational autoencoders (VAEs). GANs have been employed for tasks like image synthesis, super-resolution, and image-to-image translation. VAEs have been utilized for tasks like image generation, anomaly detection, and data augmentation.
Interpretability and explainability: As ConvNets have become more complex, the need for interpretability and explainability has also grown. Techniques like Grad-CAM (Gradient-weighted Class Activation Mapping) and attention maps help visualize which regions of an input image are important for the network's predictions. These approaches aid in understanding the decision-making process of ConvNets.
Continual learning: ConvNets have faced challenges in adapting to new tasks without forgetting previously learned knowledge.
Explores the type of structure learned by Convolutional Neural Networks, the applications where they're most valuable and a number of appropriate mental models for understanding deep learning.
PR-297: Training data-efficient image transformers & distillation through att...Jinwon Lee
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 297번째 리뷰입니다
어느덧 PR-12 시즌 3의 끝까지 논문 3편밖에 남지 않았네요.
시즌 3가 끝나면 바로 시즌 4의 새 멤버 모집이 시작될 예정입니다. 많은 관심과 지원 부탁드립니다~~
(멤버 모집 공지는 Facebook TensorFlow Korea 그룹에 올라올 예정입니다)
오늘 제가 리뷰한 논문은 Facebook의 Training data-efficient image transformers & distillation through attention 입니다.
Google에서 나왔던 ViT논문 이후에 convolution을 전혀 사용하지 않고 오직 attention만을 이용한 computer vision algorithm에 어느때보다 관심이 높아지고 있는데요
이 논문에서 제안한 DeiT 모델은 ViT와 같은 architecture를 사용하면서 ViT가 ImageNet data만으로는 성능이 잘 안나왔던 것에 비해서
Training 방법 개선과 새로운 Knowledge Distillation 방법을 사용하여 mageNet data 만으로 EfficientNet보다 뛰어난 성능을 보여주는 결과를 얻었습니다.
정말 CNN은 이제 서서히 사라지게 되는 것일까요? Attention이 computer vision도 정복하게 될 것인지....
개인적으로는 당분간은 attention 기반의 CV 논문이 쏟아질 거라고 확신하고, 또 여기에서 놀라운 일들이 일어날 수 있을 거라고 생각하고 있습니다
CNN은 10년간 많은 연구를 통해서 발전해왔지만, transformer는 이제 CV에 적용된 지 얼마 안된 시점이라서 더 기대가 크구요,
attention이 inductive bias가 가장 적은 형태의 모델이기 때문에 더 놀라운 이들을 만들 수 있을거라고 생각합니다
얼마 전에 나온 open AI의 DALL-E도 그 대표적인 예라고 할 수 있을 것 같습니다. Transformer의 또하나의 transformation이 궁금하신 분들은 아래 영상을 참고해주세요
영상링크: https://youtu.be/DjEvzeiWBTo
논문링크: https://arxiv.org/abs/2012.12877
PR-207: YOLOv3: An Incremental ImprovementJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 207번째 논문 review입니다
이번 논문은 YOLO v3입니다.
매우 유명한 논문이라서 크게 부연설명이 필요없을 것 같은데요, Object Detection algorithm들 중에 YOLO는 굉장히 특색있는 one-stage algorithm입니다. 이 논문에서는 YOLO v2(YOLO9000) 이후에 성능 향상을 위하여 어떤 것들을 적용하였는지 하나씩 설명해주고 있습니다. 또한 MS COCO의 metric인 average mAP에 대해서 비판하면서 mAP를 평가하는 방법에 대해서도 얘기를 하고 있는데요, 자세한 내용은 영상을 참고해주세요~
논문링크: https://arxiv.org/abs/1804.02767
영상링크: https://youtu.be/HMgcvgRrDcA
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignJinwon Lee
Tensorfkow-KR 논문읽기모임 PR12 144번째 논문 review입니다.
이번에는 Efficient CNN의 대표 중 하나인 SqueezeNext를 review해보았습니다. SqueezeNext의 전신인 SqueezeNet도 같이 review하였고, CNN을 평가하는 metric에 대한 논문인 NetScore에서 SqueezeNext가 1등을 하여 NetScore도 같이 review하였습니다.
논문링크:
SqueezeNext - https://arxiv.org/abs/1803.10615
SqueezeNet - https://arxiv.org/abs/1602.07360
NetScore - https://arxiv.org/abs/1806.05512
영상링크: https://youtu.be/WReWeADJ3Pw
PR-217: EfficientDet: Scalable and Efficient Object DetectionJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 217번째 논문 review입니다
이번 논문은 GoogleBrain에서 쓴 EfficientDet입니다. EfficientNet의 후속작으로 accuracy와 efficiency를 둘 다 잡기 위한 object detection 방법을 제안한 논문입니다. 이를 위하여 weighted bidirectional feature pyramid network(BiFPN)과 EfficientNet과 유사한 방법의 detection용 compound scaling 방법을 제안하고 있는데요, 자세한 내용은 영상을 참고해주세요
논문링크: https://arxiv.org/abs/1911.09070
영상링크: https://youtu.be/11jDC8uZL0E
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...Joonhyung Lee
A presentation on CutMix, a beautifully simple general data augmentation/regularization method for improving performance and reducing over-fitting on deep learning models. Being applied to the input data, it can be used on any model for almost any task. It has been shown to be effective for a wide variety of tasks and may be used for many more.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/08/how-transformers-are-changing-the-direction-of-deep-learning-architectures-a-presentation-from-synopsys/
Tom Michiels, System Architect for DesignWare ARC Processors at Synopsys, presents the “How Transformers are Changing the Direction of Deep Learning Architectures” tutorial at the May 2022 Embedded Vision Summit.
The neural network architectures used in embedded real-time applications are evolving quickly. Transformers are a leading deep learning approach for natural language processing and other time-dependent, series data applications. Now, transformer-based deep learning network architectures are also being applied to vision applications with state-of-the-art results compared to CNN-based solutions.
In this presentation, Michiels introduces transformers and contrast them with the CNNs commonly used for vision tasks today. He examines the key features of transformer model architectures and shows performance comparisons between transformers and CNNs. He concludes the presentation with insights on why Synopsys thinks transformers are an important approach for future visual perception tasks.
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
Modern Convolutional Neural Network techniques for image segmentationGioele Ciaparrone
Recently, Convolutional Neural Networks have been successfully applied to image segmentation tasks. Here we present some of the most recent techniques that increased the accuracy in such tasks. First we describe the Inception architecture and its evolution, which allowed to increase width and depth of the network without increasing the computational burden. We then show how to adapt classification networks into fully convolutional networks, able to perform pixel-wise classification for segmentation tasks. We finally introduce the hypercolumn technique to further improve state-of-the-art on various fine-grained localization tasks.
PR-207: YOLOv3: An Incremental ImprovementJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 207번째 논문 review입니다
이번 논문은 YOLO v3입니다.
매우 유명한 논문이라서 크게 부연설명이 필요없을 것 같은데요, Object Detection algorithm들 중에 YOLO는 굉장히 특색있는 one-stage algorithm입니다. 이 논문에서는 YOLO v2(YOLO9000) 이후에 성능 향상을 위하여 어떤 것들을 적용하였는지 하나씩 설명해주고 있습니다. 또한 MS COCO의 metric인 average mAP에 대해서 비판하면서 mAP를 평가하는 방법에 대해서도 얘기를 하고 있는데요, 자세한 내용은 영상을 참고해주세요~
논문링크: https://arxiv.org/abs/1804.02767
영상링크: https://youtu.be/HMgcvgRrDcA
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignJinwon Lee
Tensorfkow-KR 논문읽기모임 PR12 144번째 논문 review입니다.
이번에는 Efficient CNN의 대표 중 하나인 SqueezeNext를 review해보았습니다. SqueezeNext의 전신인 SqueezeNet도 같이 review하였고, CNN을 평가하는 metric에 대한 논문인 NetScore에서 SqueezeNext가 1등을 하여 NetScore도 같이 review하였습니다.
논문링크:
SqueezeNext - https://arxiv.org/abs/1803.10615
SqueezeNet - https://arxiv.org/abs/1602.07360
NetScore - https://arxiv.org/abs/1806.05512
영상링크: https://youtu.be/WReWeADJ3Pw
PR-217: EfficientDet: Scalable and Efficient Object DetectionJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 217번째 논문 review입니다
이번 논문은 GoogleBrain에서 쓴 EfficientDet입니다. EfficientNet의 후속작으로 accuracy와 efficiency를 둘 다 잡기 위한 object detection 방법을 제안한 논문입니다. 이를 위하여 weighted bidirectional feature pyramid network(BiFPN)과 EfficientNet과 유사한 방법의 detection용 compound scaling 방법을 제안하고 있는데요, 자세한 내용은 영상을 참고해주세요
논문링크: https://arxiv.org/abs/1911.09070
영상링크: https://youtu.be/11jDC8uZL0E
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...Joonhyung Lee
A presentation on CutMix, a beautifully simple general data augmentation/regularization method for improving performance and reducing over-fitting on deep learning models. Being applied to the input data, it can be used on any model for almost any task. It has been shown to be effective for a wide variety of tasks and may be used for many more.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/08/how-transformers-are-changing-the-direction-of-deep-learning-architectures-a-presentation-from-synopsys/
Tom Michiels, System Architect for DesignWare ARC Processors at Synopsys, presents the “How Transformers are Changing the Direction of Deep Learning Architectures” tutorial at the May 2022 Embedded Vision Summit.
The neural network architectures used in embedded real-time applications are evolving quickly. Transformers are a leading deep learning approach for natural language processing and other time-dependent, series data applications. Now, transformer-based deep learning network architectures are also being applied to vision applications with state-of-the-art results compared to CNN-based solutions.
In this presentation, Michiels introduces transformers and contrast them with the CNNs commonly used for vision tasks today. He examines the key features of transformer model architectures and shows performance comparisons between transformers and CNNs. He concludes the presentation with insights on why Synopsys thinks transformers are an important approach for future visual perception tasks.
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
Modern Convolutional Neural Network techniques for image segmentationGioele Ciaparrone
Recently, Convolutional Neural Networks have been successfully applied to image segmentation tasks. Here we present some of the most recent techniques that increased the accuracy in such tasks. First we describe the Inception architecture and its evolution, which allowed to increase width and depth of the network without increasing the computational burden. We then show how to adapt classification networks into fully convolutional networks, able to perform pixel-wise classification for segmentation tasks. We finally introduce the hypercolumn technique to further improve state-of-the-art on various fine-grained localization tasks.
TensorFlow Korea 논문읽기모임 PR12 243째 논문 review입니다
이번 논문은 RegNet으로 알려진 Facebook AI Research의 Designing Network Design Spaces 입니다.
CNN을 디자인할 때, bottleneck layer는 정말 좋을까요? layer 수는 많을 수록 높은 성능을 낼까요? activation map의 width, height를 절반으로 줄일 때(stride 2 혹은 pooling), channel을 2배로 늘려주는데 이게 최선일까요? 혹시 bottleneck layer가 없는 게 더 좋지는 않은지, 최고 성능을 내는 layer 수에 magic number가 있는 건 아닐지, activation이 절반으로 줄어들 때 channel을 2배가 아니라 3배로 늘리는 게 더 좋은건 아닌지?
이 논문에서는 하나의 neural network을 잘 design하는 것이 아니라 Auto ML과 같은 기술로 좋은 neural network을 찾을 수 있는 즉 좋은 neural network들이 살고 있는 좋은 design space를 design하는 방법에 대해서 얘기하고 있습니다. constraint이 거의 없는 design space에서 human-in-the-loop을 통해 좋은 design space로 그 공간을 좁혀나가는 방법을 제안하였는데요, EfficientNet보다 더 좋은 성능을 보여주는 RegNet은 어떤 design space에서 탄생하였는지 그리고 그 과정에서 우리가 당연하게 여기고 있었던 design choice들이 잘못된 부분은 없었는지 아래 동영상에서 확인하실 수 있습니다~
영상링크: https://youtu.be/bnbKQRae_u4
논문링크: https://arxiv.org/abs/2003.13678
Convolutional Neural Networks : Popular Architecturesananth
In this presentation we look at some of the popular architectures, such as ResNet, that have been successfully used for a variety of applications. Starting from the AlexNet and VGG that showed that the deep learning architectures can deliver unprecedented accuracies for Image classification and localization tasks, we review other recent architectures such as ResNet, GoogleNet (Inception) and the more recent SENet that have won ImageNet competitions.
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...taeseon ryu
요즘 대형 비전 트랜스포머(ViT)의 발전에 비해, 합성곱 신경망(CNN)을 기반으로 한 대형 모델은 아직 초기 단계에 머물러 있습니다. 본 연구는 InternImage라는 새로운 대규모 CNN 기반 모델을 제안합니다. 이 모델은 ViT와 같이 매개변수와 학습 데이터를 늘리는 이점을 얻을 수 있습니다. 최근에는 대형 밀집 커널에 초점을 맞춘 CNN과는 달리, InternImage는 변형 가능한 컨볼루션을 핵심 연산자로 사용합니다. 이를 통해 모델은 감지 및 세분화와 같은 하향 작업에 필요한 큰 유효 수용영역을 갖게 되며, 입력 및 작업 정보에 의존하는 적응형 공간 집계도 가능합니다. 이로 인해, InternImage는 기존 CNN의 엄격한 귀납적 편향을 줄이고, ViT와 같은 대규모 매개변수와 대규모 데이터로 더 강력하고 견고한 패턴을 학습할 수 있게 됩니다. 논문에서 제시한 모델의 효과성은 ImageNet, COCO 및 ADE20K와 같은 어려운 벤치마크에서 입증되었습니다. InternImage-H는 COCO test-dev에서 65.4 mAP, ADE20K에서 62.9 mIoU를 달성하여 현재 최고의 CNN 및 ViT를 능가하는 새로운 기록을 세웠습니다
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
#PR12 #PR344
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 344번째 논문 리뷰입니다.
오늘은 중국과기대와 MSRA에서 나온 A Battle of Network Structures라는 강렬한 제목을 가진 논문입니다.
부제에서 잘 나와있듯이 이 논문은 computer vision에서 CNN, Transformer, MLP에 대해서 같은 환경에서 비교를 통해 어떤 특징들이 있는지를 알아본 논문입니다.
우선 같은 조건에서 실험하기 위하여 SPACH라는 unified framework을 만들고 그 안에 CNN, Transformer, MLP를 넣어서 실험을 합니다.
셋 모두 조건이 잘 갖춰지면 비슷한 성능을 내지만, MLP는 model size가 커지면 overfitting이 발생하고
CNN은 Transformer에 비해서 적은 data에서도 좋은 성능이 나오는 generalization capability가 좋고,
Transformer는 model capacity가 커서 data가 충분하고 연산량도 큰 환경에서 잘한다는 것이 실험의 한가지 결과입니다.
또하나는 global receptive field를 갖는 transformer나 MLP의 경우에도 local한 연산을 하는 local model을 같이 써줄때에 성능이 좋아진다는 것입니다.
이런 insight들을 통해서 이 논문에서는 CNN과 Transformer를 결합한 형태의 Hybrid model을 제안하여 SOTA 성능을 낼 수 있음을 보여줍니다.
개인적으로 놀랄만한 insight를 발견한 것은 아니었지만 세가지 network의 특징과 장단점에 대해서 정리해볼 수 있는 그런 논문이라고 평하고 싶습니다.
자세한 내용은 영상을 참고해주세요! 감사합니다
영상링크: https://youtu.be/NVLMZZglx14
논문링크: https://arxiv.org/abs/2108.13002
PR-183: MixNet: Mixed Depthwise Convolutional KernelsJinwon Lee
TensorFlow-KR 논문읽기모임 PR12(12PR) 183번째 논문 review입니다.
이번에 살펴볼 논문은 Google Brain에서 발표한 MixNet입니다. Efficiency를 추구하는 CNN에서 depthwise convolution이 많이 사용되는데, 이 때 depthwise convolution filter의 size를 다양하게 해서 성능도 높이고 efficiency도 높이는 방법을 제안한 논문입니다. 자세한 내용은 영상을 참고해주세요
논문링크 : https://arxiv.org/abs/1907.09595
발표영상 : https://youtu.be/252YxqpHzsg
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...Ontico
Производительность инференса - одна из самых серьезных проблем при внедрении DL приложений, так как она определяет, какое впечатление от сервиса останется у конечного пользователя, а также какова будет цена внедрения этого продукта. Таким образом, для инференса важно быть высокопроизводительным и энергоэффективным. TensorRT автоматически оптимизирует обученную нейронную сеть для максимальной производительности, обеспечивая существенное ускорение по сравнению с обычными часто используемыми фреймворками.
Из презентации вы узнаете, какие оптимизации применяются в TensorRT, как его использовать и увидите, насколько он быстр в избранных задачах.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Morph: Flexible Acceleration for 3D CNN-based Video Understanding.
Kartik Hegde, Rohit Agrawal, Yulun Yao, Christopher W. Fletcher
University of Illinois
2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
Similar to ConvNeXt: A ConvNet for the 2020s explained (20)
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
1. A ConvNet for the 2020s
Zhuang Liu et al.
Sushant Gautam
MSIISE
Department of Electronics and
Computer Engineering
Institute of Engineering,
Thapathali Campus
13 March 2022
2. Introduction: ConvNets
• LeNet/ AlexNet / ResNets
• Have several built-in inductive biases
• well-suited to a wide variety of computer vision applications.
• Translation equivariance
• Shared Computations:
• inherently efficient in a sliding-window
• Fundamental building block in visual recognition system
3/13/2022 2
3. Transformers in NLP: BERT
3/13/2022 3
• NN design for NLP:
• Transformers replaced
RNNs
• become the dominant
backbone architecture
5. Introduction: Vision Transformers
• Vision Transformers (ViT) in 2020:
• completely altered the landscape of network architecture design.
• two streams surprisingly converged
• Transformers outperform ResNets by a significant margin.
• on image classification
• only on image classification
3/13/2022 5
11. ViT’s were challenging
• ViT’s global attention design:
• quadratic complexity with respect to the input size.
• Hierarchical Transformers
• hybrid approach to bridge this gap.
• for example, the sliding window strategy
(e.g., attention within local windows)
• behave more similarly to ConvNets.
3/13/2022 11
12. Swin Transformer
• a milestone work in this direction
• demonstrated Transformers as a generic vision backbone
• achieved SOTA results across a range of computer vision tasks
• beyond image classification.
• successful and rapid adoption revealed:
• the essence of convolution is not becoming irrelevant; rather,
• it remains much desired and has never faded.
3/13/2022 12
19. Swin Transformer
• Still relies on Patches
• But instead of choosing one
• Start with smaller patch and merge in the deeper transformer layers
• Similar to UNet and Convolution
3/13/2022 19
21. Swin Transformer
• Shifted windows Transformer
• Self-attention in non-overlapped windows
• Hierarchical feature maps
3/13/2022 21
22. Back to the topic: ConvNeXt
Objectives:
• bridge the gap between the pre-ViT and post-ViT ConvNets
• test the limits of what a pure ConvNet can achieve.
• Exploration: How do design decisions in Transformers impact
ConvNets’ performance?
• Result: a family of pure ConvNets: ConvNeXt.
3/13/2022 22
23. Modernizing a ConvNet: a Roadmap
• Recipe: Two ResNet model sizes in terms of G-FLOPs
ResNet-50 / Swin-T ~4.5x109 FLOPs
ResNet-200 / Swin-B ~15.0x109 FLOPs
• Network Modernization
1. ViT training techniques for ResNet
2. ResNeXt
3. Inverted bottleneck
4. Large kernel size
5. Various layer-wise micro designs
3/13/2022 23
24. ViT Training Techniques
• Training epochs: 300 epochs (from the original 90 epochs for ResNets)
• Optimizer: AdamW (Weight Decay)
• Data augmentation
Mixup
Cutmix
RandAugment
Random Erasing
• Regularization
Stochastic Depth
Label Smoothing
• This increased the performance of the ResNet-50 model from 76.1% to 78.8%(+2.7%)
3/13/2022 24
Label Smoothing making
model less over smart
25. Changing Stage Compute Ratio
• Swin-T’s computation ratio of each stage is 1:1:3:1, and for larger
• Swin-B Transformers, the ratio is 1:1:9:1.
• #Conv blocks in ResNet-50: (3, 4, 6, 3)
• ConvNeXt adopts to (3, 3, 9, 3), for ResNet-50
• which also aligns the FLOPs with Swin-T.
• This improves the model accuracy from 78.8% to 79.4%.
3/13/2022 25
26. Changing Stem to “Patchify”
• Stem cell of Standard ResNet: 7x7 convolution layer
• followed by a max pool, which result in a 4x downsampling.
• Stem cell of ViT: a more aggressive “patchify” strategy
• corresponds to a large kernel size(14 or 16) + non-overlapping convolution.
• Swin Transformer uses a similar “patchfiy” layer
• smaller patch size of 4: accommodate the architecture’s multi-stage design.
• ConvNeXt replaces the ResNet-style CNN stem cell
• with a patchify layer implemented using a 4x4, stride 4 convolution layer.
• The accuracy has changed from 79.4% to 79.5%.
3/13/2022 26
27. ResNeXt-ify
• ResNeXt has a better FLOPs/accuracy trade-off than ResNet.
• The core component is grouped convolution.
• Guiding principle: “use more groups, expand width”.
• ConvNeXt uses depthwise convolution, a special case of grouped
convolution where the number of groups equals the number of
channels.
• ConvNeXt increases the network width to the same number of
channels as Swin-T’s (from 64 to 96)
• This brings the network performance to 80.5% with increased
FLOPs (5.3G)
3/13/2022 27
29. Inverted Bottleneck
• Important design in every Transformer
• the hidden dimension of the MLP block is 4 times wider than the input
• Connected to the design in ConvNets with an expansion ratio of
• popularized by MobileNetV2 and gained traction in advanced ConvNets.
• Costly: increased FLOPs for the depth wise convolution layer
• Significant FLOPs reduction in the down sampling residual blocks.
• Reduces the whole network FLOPs to 4.6G
• Interestingly, improved performance
(80.5% to 80.6%).
3/13/2022 29
30. Moving Up Depthwise Conv Layer
• Moving up position of depth wise conv layer as in Transformers.
• The MSA block is placed prior to the MLP layers.
• As ConvNeXt has an inverted bottleneck block, this is a natural design
choice.
The complex/inefficien modules (MSA, large-kernel conv) will have fewer
channels, while the efficient, dense 1x1 layers will do the heavy lifting.
• This intermediate step reduces the FLOPs to 4.1G, resulting in a
temporary performance degradation to 79.9%.
3/13/2022 30
Note: a self-attention operation can be viewed as a
dynamic lightweight convolution, a depthwise separable
convolutions.
31. Increasing the Kernel Size
• Swin Transformers reintroduced the local window to the self-
attention block, the window size is at least 7x7, significantly larger
than the ResNe(X)t kernel size of 3x3.
• The authors experimented with several kernel sizes, including 3, 5, 7,
9, and 11. The network’s performance increases from 79.9% (3x3) to
80.6% (7x7), while the network’s FLOPs stay roughly the same.
• A ResNet-200 regime model does not exhibit further gain when
increasing the kernel size beyond 7x7.
3/13/2022 31
32. Replacing ReLU with GELU
• One discrepancy between NLP and vision architectures is the
specifics of which activation functions to use.
• Gaussian Error Linear Unit, or GELU
• a smoother variant of ReLU,
• utilized in the most advanced Transformers
• Google’s BERT and OpenAI’s GPT-2, ViTs.
3/13/2022 32
• ReLU can be substituted with
GELU in ConvNeXt too.
• But the accuracy stays
unchanged (80.6%).
33. Fewer Activation Functions
• Transformers have fewer activation functions than ResNet
• There is only one activation function present in the MLP block.
• In comparison, it is common practice to append an activation function
to each convolutional layer, including the 1x1 convs.
• Replicate the Transformer block:
• eliminates all GELU layers from the residual block except for one between
two 1x1 layers
• This process improves the result by 0.7% to 81.3%, practically
matching the performance of Swin-T.
3/13/2022 33
34. Fewer Normalization Layers
• Transformer blocks usually have fewer normalization layers as well.
• ConvNeXt removes two BatchNorm (BN) layers, leaving only one BN
layer before the conv 1x1 layers.
• This further boosts the performance to 81.4%, already surpassing
Swin-T’s result.
3/13/2022 34
35. Substituting BN with LN
• BatchNorm is an essential component in ConvNets as it improves the
convergence and reduces overfitting.
• Simpler Layer Normalization(LN) has been used in Transformers,
• resulting in good performance across different application scenarios.
• Directly substituting LN for BN in the original ResNet will result in
suboptimal performance.
• With all modifications in network architecture and training techniques,
the authors revisit the impact of using LN in place of BN.
• ConvNeXt model does not have any difficulties training with LN; in
fact, the performance is slightly better, obtaining an accuracy of
81.5%.
3/13/2022 35
36. Separate Downsampling Layers
• In ResNet, the spatial downsampling is achieved by the residual block
at the start of each stage, using 3x3 conv with stride 2 (and 1x1 conv
with stride 2 at the shortcut connection).
• In Swin Transformers, a separate downsampling layer is added
between stages.
• ConvNeXt uses 2x2 conv layers with stride 2 for spatial downsampling
and adding normalization layers wherever spatial resolution is
changed can help stablize training.
• This can improve the accuracy to 82.0%, significantly exceeding
Swin-
• T’s 81.3%.
3/13/2022 36
39. Closing Remarks
• ConvNeXt, a pure ConvNet, that can outperform the Swin
Transformer for ImageNet-1K classification in this compute regime.
• It is also worth noting that none of the design options discussed thus
far is novel—they have all been researched separately, but not
collectively, over the last decade.
• ConvNeXt model has approximately the same FLOPs, #params.,
throughput, and memory use as the Swin Transformer, but does not
require specialized modules such as shifted window attention or
relative position biases.
3/13/2022 39
41. Results – ImageNet-22K
• A widely held view is that vision
Transformers have fewer inductive
biases thus can per form better
than ConvNets when pre- trained
on a larger scale.
• Properly designed ConvNets are
not inferior to vision Transformers
when pre-trained with large dataset
ConvNeXts still perform on par or
better than similarly-sized Swin
Transformers, with slightly higher
throughput.
3/13/2022 41
43. Isotropic ConvNeXt vs. ViT
• Examining if ConvNeXt block design is generalizable to ViT-style
isotropic architectures which have no downsampling layers and keep
the same feature resolutions (e.g. 14x14) at all depths.
• ConvNeXt can perform generally on par with ViT, showing that
ConvNeXt block design is competitive when used in non-hierarchical
models.
3/13/2022 43
44. Results – Object Detection and Segmentation
on COCO, Semantic Segmentation on ADE20K
3/13/2022 44
46. Remarks on Model Efficiency[1]
• Under similar FLOPs, models with depthwise convolutions are known to be slower
and consume more memory than ConvNets with only dense convolutions.
• Natural to ask whether the design of ConvNeXt will render it practically inefficient.
• As demonstrated throughout the paper, the inference throughputs of ConvNeXts
are comparable to or exceed that of Swin Transformers. This is true for both
classification and other tasks requiring higher-resolution inputs.
3/13/2022 46
47. Remarks on Model Efficiency[2]
• Furthermore, training ConvNeXts requires less memory than training Swin
Transformers. Training Cascade Mask-RCNN using ConvNeXt-B backbone
consumes 17.4GB of peak memory with a per-GPU batch size of 2, while the
reference number for Swin-B is 18.5GB.
• It is worth noting that this improved efficiency is a result of the ConvNet inductive
bias, and is not directly related to the self-attention mechanism in vision
Transformers
3/13/2022 47
48. Conclusions[1]
• In the 2020s, vision Transformers, particularly hierarchical ones such as Swin
Transformers, began to overtake ConvNets as the favored choice for generic
vision backbones.
• The widely held belief is that vision Transformers are more accurate,
efficient, and scalable than ConvNets.
• ConvNeXts, a pure ConvNet model can compete favorably with state-of- the-art
hierarchical vision Transformers across multiple computer vision benchmarks,
while retaining the simplicity and efficiency of standard ConvNets.
3/13/2022 48
49. Conclusions[2]
• In some ways, these observations are surprising while ConvNeXt
model itself is not completely new—many design choices have all
been examined separately over the last decade, but not collectively.
• The authors hope that the new results reported in this study will
challenge several widely held views and prompt people to rethink the
importance of convolution in computer vision.
3/13/2022 49