Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSEThiyagarajan G
This document contains a summary of a computer graphics exam with 10 multiple choice questions in Part A and 4 long answer questions in Part B. Some of the key topics covered include: image resolution, scaling matrices, color conversion between RGB and CMY color modes, Bezier curves, projection planes, dithering, animation principles, turtle attributes in graphics, Bresenham's circle algorithm, Liang-Barsky line clipping algorithm, viewing transformations, cubic Bezier curves, and backface detection. Part B also includes questions on orthographic vs axonometric vs oblique projections, ambient lighting models, raster vs keyframe animation, ray tracing, and morphing.
This document presents the design and implementation of an FPGA-based BCH decoder. It discusses BCH codes, which are binary error-correcting codes used in wireless communications. The implemented decoder is for a (15, 5, 3) BCH code, meaning it can correct up to 3 errors in a block of 15 bits. The decoder uses a serial input/output architecture and is implemented using VHDL on a FPGA device. It performs BCH decoding through syndrome calculation, running the Berlekamp-Massey algorithm to solve the key equation, and using Chien search to find error locations. The simulation result verifies correct decoding operation.
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSEThiyagarajan G
This document contains a summary of a computer graphics exam with 10 multiple choice questions in Part A and 4 long answer questions in Part B. Some of the key topics covered include: image resolution, scaling matrices, color conversion between RGB and CMY color modes, Bezier curves, projection planes, dithering, animation principles, turtle attributes in graphics, Bresenham's circle algorithm, Liang-Barsky line clipping algorithm, viewing transformations, cubic Bezier curves, and backface detection. Part B also includes questions on orthographic vs axonometric vs oblique projections, ambient lighting models, raster vs keyframe animation, ray tracing, and morphing.
This document presents the design and implementation of an FPGA-based BCH decoder. It discusses BCH codes, which are binary error-correcting codes used in wireless communications. The implemented decoder is for a (15, 5, 3) BCH code, meaning it can correct up to 3 errors in a block of 15 bits. The decoder uses a serial input/output architecture and is implemented using VHDL on a FPGA device. It performs BCH decoding through syndrome calculation, running the Berlekamp-Massey algorithm to solve the key equation, and using Chien search to find error locations. The simulation result verifies correct decoding operation.
Vector-Based Back Propagation Algorithm of.pdfNesrine Wagaa
This document presents a vector-based backpropagation algorithm for a supervised convolution neural network (CNN) model. The key points are:
- The CNN model consists of one convolution layer followed by three fully connected hidden layers for classification of handwritten digits using the MNIST dataset.
- The classical convolution operation is replaced by a matrix operation to avoid mathematical complexities. Convolution maps and filters are represented as vectors.
- Forward propagation involves applying the new convolution and pooling operations to extract features, then passing the output through the fully connected layers.
- Backpropagation is used to update the CNN parameters (filters, weights, biases) via gradient descent to minimize a cost function, with update equations derived for both the convolution
On Optimization of Network-coded Scalable Multimedia Service MulticastingAndrea Tassi
In the near future, the delivery of multimedia multicast services over next-generation networks is likely to become one of the main pillars of future cellular networks. In this extended abstract, we address the issue of efficiently multicasting layered video services by defining a novel optimization paradigm that is based on an Unequal Error Protection implementation of Random Linear Network Coding, and aims to ensure target service coverages by using a limited amount of radio resources.
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineSoma Boubou
Object recognition from RGB-D sensors has recently emerged as a renowned and challenging research topic. The current systems often require large amounts of time to train the models and to classify new data. We proposed an effective and fast object recognition approach from 3D data acquired from depth sensors such as Structure or Kinect sensors.
Our contribution in this work} is to present a novel fast and effective approach for real-time object recognition from 3D depth data:
- First, we extract simple but effective frame-level features, which we name as differential frames, from the raw depth data.
- Second, we build a recognition system based on Extreme Learning Machine classifier with a Local Receptive Field (ELM-LRF).
Learning Convolutional Neural Networks for Graphspione30
This document summarizes a research paper on learning convolutional neural networks for graphs. It proposes a framework called PATCHY-SAN that applies CNNs to graphs by (1) selecting a node sequence and (2) generating normalized neighborhood representations for each node. Experimental results show PATCHY-SAN achieves accuracy competitive with graph kernels while being 2-8 times more efficient on benchmark graph classification tasks. The document concludes CNNs may be especially beneficial for learning graph representations when used with this proposed framework.
The document describes techniques for image texture analysis and segmentation. It proposes a methodology using constraint satisfaction neural networks to integrate region-based and edge-based texture segmentation. The methodology initializes a CSNN using fuzzy c-means clustering, then iteratively updates the neuron probabilities and edge maps to refine the segmentation. Experimental results demonstrate improved segmentation by combining region and edge information.
The document discusses transportation problems and assignment problems in operations research. It provides:
1) An overview of transportation problems, including the mathematical formulation to minimize transportation costs while meeting supply and demand constraints.
2) Methods for obtaining initial basic feasible solutions to transportation problems, such as the North-West Corner Rule and Vogel's Approximation Method.
3) Techniques for moving towards an optimal solution, including determining net evaluations and selecting entering variables.
4) The formulation and algorithm for solving assignment problems to minimize assignment costs while ensuring each job is assigned to exactly one machine.
Injecting image priors into Learnable Compressive SubsamplingMartino Ferrari
My master thesis work extends the problem formulation of learnable compressive subsampling [1] that focuses on the learning of the best sampling operator in the Fourier domain adapted to spectral properties of a training set of images. I formulated the problem as a reconstruction from a finite number of sparse samples with a prior learned from the external dataset or learned on-fly from the images to be reconstructed. More in
details, I developed two very different methods, one using multiband coding in the spectral domain and the second using a neural network.
The new methods can be applied to many different fields of spectroscopy and Fourier optics, for example in medical (computerized tomography, magnetic resonance spectroscopy) and astronomy (the Square Kilometre Array) imaging, where the capability to reconstruct high-quality images, in the pixel domain, from a limited number of samples, in the frequency domain, is a key issue.
The proposed methods have been tested on diverse datasets covering facial images, medical and multi-band astronomical data, using the mean square error and SSIM as a perceptual measure of the quality of the reconstruction.
Finally, I explored the possible application in data acquisition systems such as computer tomography and radio astronomy. The obtained results demostrate that the properties of the proposed methods have a very promising potential for future research and extensions.
For such reason, the work was both presented at the poster session of the EUSIPCO 2018 conference in Rome and submitted for a EU patent.
[1] L. Baldassarre, Y.-H. Li, J. Scarlett, B. Gözcü, I. Bogunovic, and V.
Cevher, “Learning-based compressive subsampling,” IEEE Journal of Selected
Topics in Signal Processing, vol. 10, no. 4, pp. 809–822, 2016
TWO DIMENSIONAL MODELING OF NONUNIFORMLY DOPED MESFET UNDER ILLUMINATIONVLSICS Design
A two dimensional numerical model of an optically gated GaAs MESFET with non uniform channel doping has been developed. This is done to characterize the device as a photo detector. First photo induced voltage (Vop) at the Schottky gate is calculated for estimating the channel profile. Then Poisson’s equation for the device is solved numerically under dark and illumination condition. The paper aims at developing the MESFET 2-D model under illumination using Monte Carlo Finite Difference method. The results discuss about the optical potential developed in the device, variation of channel potential under different biasing and illumination and also about electric fields along X and Y directions. The Cgs under different illumination is also calculated. It has been observed from the results that the characteristics of the device are strongly influenced by the incident optical illumination.
The proposed method uses an online weighted ensemble of one-class SVMs for feature selection in background/foreground separation. It automatically selects the best features for different image regions. Multiple base classifiers are generated using weighted random subspaces. The best base classifiers are selected and combined based on error rates. Feature importance is computed adaptively based on classifier responses. The background model is updated incrementally using a heuristic approach. Experimental results on the MSVS dataset show the proposed method achieves higher precision, recall, and F-score than other methods compared.
This document discusses finite-difference calculus techniques used to approximate values of functions and derivatives at discrete points in reservoir simulation models. It introduces common finite-difference operators - including forward, backward, central, shift, and average operators - and examines their relationships to derivative operators in Taylor series expansions. Examples are provided to demonstrate calculating finite-difference approximations of first and second derivatives in 1D and 2D. The document also covers solving the Poisson equation and time-independent partial differential equations using finite-difference methods.
Digit recognizer by convolutional neural networkDing Li
A convolutional neural network is used to recognize handwritten digits from images. The CNN uses convolutional and max pooling layers to extract local features from the images. These local features are then fed into fully connected layers to combine them into global features used to predict the digit (0-9) in each image with a softmax output layer. The model is trained on 60,000 images and achieves 99.67% accuracy on the test set after 30 training epochs. While powerful, it is unclear if humans can fully understand the "mind" and logic of artificial neural networks.
EC8553 Discrete time signal processing ssuser2797e4
This document contains a 10 question, multiple choice exam on discrete time signal processing. It covers topics like the discrete Fourier transform (DFT), finite word length effects, fixed point vs floating point representation, and FIR filter design. Specifically, it includes questions that calculate the 4 point DFT of a sequence, define twiddle factors, compare DIT and DIF FFT algorithms, and discuss stability and causality of systems.
COMPARATIVE STUDY ON BENDING LOSS BETWEEN DIFFERENT S-SHAPED WAVEGUIDE BENDS ...cscpconf
Bending loss in the waveguide as well as the leakage losses and absorption losses along with a comparative study among different types of S-shaped bend structures has been computed with
the help of a simple matrix method.This method needs simple 2×2 matrix multiplication. The
effective-index profile of the bended waveguide is then transformed to an equivalent straight
waveguide with the help of a suitable mapping technique and is partitioned into large number of thin sections of different refractive indices. The transfer matrix of the two adjacent layers will be a 2×2 matrix relating the field components in adjacent layers. The total transfer matrix is
obtained through multiplication of all these transfer matrices. The excitation efficiency of the
wave in the guiding layer shows a Lorentzian profile. The power attenuation coefficient of the
bent waveguide is the full-width-half-maximum (FWHM) of this peak .Now the transition losses and pure bending losses can be computed from these FWHM datas.The computation technique
is quite fast and it is applicable for any waveguide having different parameters and wavelength of light for both polarizations(TE and TM)
Two Dimensional Modeling of Nonuniformly Doped MESFET Under IlluminationVLSICS Design
A two dimensional numerical model of an optically gated GaAs MESFET with non uniform channel doping has been developed. This is done to characterize the device as a photo detector. First photo induced voltage (Vop) at the Schottky gate is calculated for estimating the channel profile. Then Poisson’s equation for the device is solved numerically under dark and illumination condition. The paper aims at developing the MESFET 2-D model under illumination using Monte Carlo Finite Difference method. The results discuss about the optical potential developed in the device, variation of channel potential under different biasing and illumination and also about electric fields along X and Y directions. The Cgs under different illumination is also calculated. It has been observed from the results that the characteristics of the device are strongly influenced by the incident optical illumination.
The document provides an overview of backpropagation for neural networks. It begins by defining the loss function and discussing gradient descent. It then walks through the computational graph of a simple perceptron and derives the gradients for each operation using the chain rule. This allows computing the gradient of the loss with respect to the weights and biases, which are then updated using gradient descent. It discusses computing gradients for different activation functions like sigmoid, ReLU, and max pooling. Finally, it notes that backpropagation allows estimating parameters across stacked neural network layers.
This document provides an overview of convolutional neural networks (CNNs) and their applications. It discusses the common layers in a CNN like convolutional layers, pooling layers, and fully connected layers. It also covers hyperparameters for convolutional layers like filter size and stride. Additional topics summarized include object detection algorithms like YOLO and R-CNN, face recognition models, neural style transfer, and computational network architectures like ResNet and Inception.
This document discusses load flow analysis and loss allocation methods for unbalanced radial power distribution systems. The objectives are to develop a fast three-phase load flow method and an active loss allocation scheme for unbalanced distribution networks. It presents a proposed load flow method based on a forward/backward sweep approach with a new bus identification and multiphase data handling scheme. Test results on sample systems show the proposed method has fewer iterations and faster computation time compared to other established methods.
Similar to [MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network (20)
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
This document summarizes key developments in deep learning for object detection from 2012 onwards. It begins with a timeline showing that 2012 was a turning point, as deep learning achieved record-breaking results in image classification. The document then provides overviews of 250+ contributions relating to object detection frameworks, fundamental problems addressed, evaluation benchmarks and metrics, and state-of-the-art performance. Promising future research directions are also identified.
This document discusses non-local neural networks, which use non-local operations to capture long-range dependencies in data. The non-local operation computes the response at a position as a weighted sum of the features at all positions. Adding non-local blocks to existing models leads to improved performance on video classification tasks without increasing parameters or FLOPs significantly. Experimental results show that non-local operations are complementary to 3D convolutions and help models better capture long-range dependencies in space and time.
CVPR2016 was held in Las Vegas from June 26-July 1. The author attended and reported on trends in papers presented. Deep learning and CNNs were widely used for tasks like object detection, segmentation, pose estimation and re-identification. Fast/Faster R-CNN models were common for object detection. CNNs combined with CRFs were frequent for segmentation. Papers on 3D object detection from RGB-D data and dense 3D correspondence between human bodies using CNNs were highlighted.
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
The cost of acquiring information by natural selectionCarl Bergstrom
This is a short talk that I gave at the Banff International Research Station workshop on Modeling and Theory in Population Biology. The idea is to try to understand how the burden of natural selection relates to the amount of information that selection puts into the genome.
It's based on the first part of this research paper:
The cost of information acquisition by natural selection
Ryan Seamus McGee, Olivia Kosterlitz, Artem Kaznatcheev, Benjamin Kerr, Carl T. Bergstrom
bioRxiv 2022.07.02.498577; doi: https://doi.org/10.1101/2022.07.02.498577
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...Advanced-Concepts-Team
Presentation in the Science Coffee of the Advanced Concepts Team of the European Space Agency on the 07.06.2024.
Speaker: Diego Blas (IFAE/ICREA)
Title: Gravitational wave detection with orbital motion of Moon and artificial
Abstract:
In this talk I will describe some recent ideas to find gravitational waves from supermassive black holes or of primordial origin by studying their secular effect on the orbital motion of the Moon or satellites that are laser ranged.
5. 184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
AUTHOR(S): LEARNING OF OCCLUSION-AWARE ATTENTION FOR PEDESTRIAN DE
tion, outputting the classification scores using global average pooling or global max p
from the feature map f (·). However, global average pooling increases in the respons
of entire feature map at specific class due to using an average of all pixel at a featur
On the other hand, global max pooling does not increase the entire feature map at s
class because of using a maximum pixel value in a feature map. Response score fo
class of global average pooling and global max pooling is calculated as follow Eq. (1
vc
i =
1
M×N ∑M
m=1 ∑N
n=1 fc
m,n (xi) (global average pooling),
max fc
m,n (xi) (global max pooling),
After outputting the score for each class, the attention of pedestrian and occlusion r
are generated. First, we fuse the multiple channel feature map to one channel. In this
we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) so
weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is
summation of feature map. In softmax weighting, it is weighted the feature maps fo
channel using softmax score by Eq. (2). The softmax weighting can mask the unnec
channel feature map. In SE block fusion, it is weighted the feature maps for each c
using the attention of SE block like Squeeze-and-Excitation Network. After fusing
channel, pedestrian classification and occlusion state attentions are fused. In this wo
calculate the attention by subtracting the occlusion attention from pedestrian classifi
attention. Here, we call the attention the attention map because of containing positi
negative values.
Attentioni =
C
∑
c=1
fc
(xi)∗
exp(vc
i )
∑J
j=1 exp vj
i
3.4 Perception branch
In the perception branch, it outputs the final result score using attention map and featu
of RoI pooling. Attention map can refine the feature map of RoI pooling, such as m
unnecessary background feature and enhancing the important locations. Converted
map is made of the inner product of attention map and feature map from RoI poolin
perception branch is composed two fully connected layers like Fast R-CNN. The struc
the perception branch is the same as conventional Fast R-CNN, however, our model e
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
Anonymous CVPR submission
Paper ID ****
Abstract
1. Introduction
t log y + (1 − t) log (1 − y) (1)
vc
i =
1
M × N
M
m=1
N
n=1
fc
m,n(xi) (2)
v1
i , v2
i , v3
i , vc
i (3)
f(xi) (4)
f (xi, yi) (5)
2. Concolusion
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
Abstract
1. Introduction
t log y + (1 − t) log (1 − y) (1)
vc
i =
1
M × N
M
m=1
N
n=1
fc
m,n(xi) (2)
v1
i , v2
i , v3
i , vc
i (3)
C (4)
f (xi, yi) (5)
2. Concolusion
References
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
Paper ID ****
Abstract
1. Introduction
t log y + (1 − t) log (1 − y) (1)
vc
i =
1
M × N
M
m=1
N
n=1
fc
m,n(xi) (2)
v1
i , v2
i , v3
i , vc
i (3)
C (4)
f (xi, yi) (5)
2. Concolusion
References
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
Paper ID ****
Abstract
1. Introduction
t log y + (1 − t) log (1 − y) (1)
vc
i =
1
M × N
M
m=1
N
n=1
fc
m,n(xi) (2)
v1
i , v2
i , v3
i , vc
i (3)
C (4)
f (xi, yi) (5)
2. Concolusion
90
91
92
93
94
95
96
97
98
99
00
01
02
03
04
05
06
07
08
09
10
11
12
vc
i =
1
M×N ∑M
m=1 ∑N
n=1 fc
m,n (xi) (global average pooling),
max fc
m,n (xi) (global max pooling),
(1)
After outputting the score for each class, the attention of pedestrian and occlusion regions
are generated. First, we fuse the multiple channel feature map to one channel. In this work,
we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) softmax-
weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is simply
summation of feature map. In softmax weighting, it is weighted the feature maps for each
channel using softmax score by Eq. (2). The softmax weighting can mask the unnecessary
channel feature map. In SE block fusion, it is weighted the feature maps for each channel
using the attention of SE block like Squeeze-and-Excitation Network. After fusing to one
channel, pedestrian classification and occlusion state attentions are fused. In this work, we
calculate the attention by subtracting the occlusion attention from pedestrian classification
attention. Here, we call the attention the attention map because of containing positive and
negative values.
Attentioni =
C
∑
c=1
fc
(xi)∗
exp(vc
i )
∑J
j=1 exp vj
i
(2)
3.4 Perception branch
In the perception branch, it outputs the final result score using attention map and feature map
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
AUTHOR(S): LEARNING OF OCCLUSION-AWARE ATTENTION FOR PEDESTRIAN DETECTION5
tion, outputting the classification scores using global average pooling or global max pooling
from the feature map f (·). However, global average pooling increases in the response value
of entire feature map at specific class due to using an average of all pixel at a feature map.
On the other hand, global max pooling does not increase the entire feature map at specific
class because of using a maximum pixel value in a feature map. Response score for each
class of global average pooling and global max pooling is calculated as follow Eq. (1).
vc
i =
1
M×N ∑M
m=1 ∑N
n=1 fc
m,n (xi) (global average pooling),
max fc
m,n (xi) (global max pooling),
(1)
After outputting the score for each class, the attention of pedestrian and occlusion regions
are generated. First, we fuse the multiple channel feature map to one channel. In this work,
we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) softmax-
weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is simply
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
AUTHOR(S): LEARNING OF OCCLUSION-AWARE ATTENTION FOR PEDESTRIAN DETECTION5
tion, outputting the classification scores using global average pooling or global max pooling
from the feature map f (·). However, global average pooling increases in the response value
of entire feature map at specific class due to using an average of all pixel at a feature map.
On the other hand, global max pooling does not increase the entire feature map at specific
class because of using a maximum pixel value in a feature map. Response score for each
class of global average pooling and global max pooling is calculated as follow Eq. (1).
vc
i =
1
M×N ∑M
m=1 ∑N
n=1 fc
m,n (xi) (global average pooling),
max fc
m,n (xi) (global max pooling),
(1)
After outputting the score for each class, the attention of pedestrian and occlusion regions
are generated. First, we fuse the multiple channel feature map to one channel. In this work,
we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) softmax-
weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is simply
summation of feature map. In softmax weighting, it is weighted the feature maps for each
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
Abstract
1. Introduction
t log y + (1 − t) log (1 − y) (1)
vc
i =
1
M × N
M
m=1
N
n=1
fc
m,n(xi) (2)
v1
i , v2
i , v3
i , vc
i (3)
f(xi) (4)
f (xi, yi) (5)
2. Concolusion
References
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
How Small Network Can Detect Ped
Anonymous CVPR submission
Paper ID ****
Abstract
1. Introduction
t log y + (1 − t) log (1 − y) (1)
vc
i =
1
M × N
M
m=1
N
n=1
fc
m,n(xi) (2)
M, N (3)
C (4)
6. Table 1. Classification error on the ILSVRC validation set.
Networks top-1 val. error top-5 val. error
VGGnet-GAP 33.4 12.2
GoogLeNet-GAP 35.0 13.2
AlexNet∗-GAP 44.9 20.9
AlexNet-GAP 51.1 26.3
GoogLeNet 31.9 11.3
VGGnet 31.2 11.4
AlexNet 42.6 19.5
NIN 41.9 19.6
GoogLeNet-GMP 35.6 13.9
Table 2. Localization error on the ILSVRC validation set. Bac
prop refers to using [23] for localization instead of CAM.
Method top-1 val.error top-5 val. error
GoogLeNet-GAP 56.40 43.00
VGGnet-GAP 57.20 45.14
GoogLeNet 60.09 49.34
AlexNet∗-GAP 63.75 49.53
AlexNet-GAP 67.19 52.16
NIN 65.47 54.19
Backprop on GoogLeNet 61.31 50.55
13. irshick2
Piotr Doll´ar2
Zhuowen Tu1
Kaiming He2
C San Diego 2
Facebook AI Research
@ucsd.edu {rbg,pdollar,kaiminghe}@fb.com
rized network archi-
etwork is constructed
egates a set of trans-
ur simple design re-
architecture that has
is strategy exposes a
ality” (the size of the
factor in addition to
On the ImageNet-1K
under the restricted
ncreasing cardinality
racy. Moreover, in-
han going deeper or
Our models, named
entry to the ILSVRC
secured 2nd place.
256, 1x1, 4
4, 3x3, 4
4, 1x1, 256
+
256, 1x1, 4
4, 3x3, 4
4, 1x1, 256
256, 1x1, 4
4, 3x3, 4
4, 1x1, 256
....
total 32
paths
256-d in
+
256, 1x1, 64
64, 3x3, 64
64, 1x1, 256
+
256-d in
256-d out
256-d out
Figure 1. Left: A block of ResNet [14]. Right: A block of
ResNeXt with cardinality = 32, with roughly the same complex-
ity. A layer is shown as (# in channels, filter size, # out channels).
ing blocks of the same shape. This strategy is inherited
by ResNets [14] which stack modules of the same topol-
ogy. This simple rule reduces the free choices of hyper-
parameters, and depth is exposed as an essential dimension
Aggregated Residual Transformations for Deep Neural Networks
Saining Xie1
Ross Girshick2
Piotr Doll´ar2
Zhuowen Tu1
Kaiming He2
1
UC San Diego 2
Facebook AI Research
{s9xie,ztu}@ucsd.edu {rbg,pdollar,kaiminghe}@fb.com
Abstract
We present a simple, highly modularized network archi-
tecture for image classification. Our network is constructed
by repeating a building block that aggregates a set of trans-
formations with the same topology. Our simple design re-
sults in a homogeneous, multi-branch architecture that has
only a few hyper-parameters to set. This strategy exposes a
new dimension, which we call “cardinality” (the size of the
set of transformations), as an essential factor in addition to
the dimensions of depth and width. On the ImageNet-1K
dataset, we empirically show that even under the restricted
condition of maintaining complexity, increasing cardinality
is able to improve classification accuracy. Moreover, in-
creasing cardinality is more effective than going deeper or
wider when we increase the capacity. Our models, named
ResNeXt, are the foundations of our entry to the ILSVRC
2016 classification task in which we secured 2nd place.
We further investigate ResNeXt on an ImageNet-5K set and
the COCO detection set, also showing better results than
its ResNet counterpart. The code and models are publicly
available online1
.
1. Introduction
256, 1x1, 4
4, 3x3, 4
4, 1x1, 256
+
256, 1x1, 4
4, 3x3, 4
4, 1x1, 256
256, 1x1, 4
4, 3x3, 4
4, 1x1, 256
....
total 32
paths
256-d in
+
256, 1x1, 64
64, 3x3, 64
64, 1x1, 256
+
256-d in
256-d out
256-d out
Figure 1. Left: A block of ResNet [14]. Right: A block of
ResNeXt with cardinality = 32, with roughly the same complex-
ity. A layer is shown as (# in channels, filter size, # out channels).
ing blocks of the same shape. This strategy is inherited
by ResNets [14] which stack modules of the same topol-
ogy. This simple rule reduces the free choices of hyper-
parameters, and depth is exposed as an essential dimension
in neural networks. Moreover, we argue that the simplicity
of this rule may reduce the risk of over-adapting the hyper-
parameters to a specific dataset. The robustness of VGG-
nets and ResNets has been proven by various visual recog-
nition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasks
involving speech [42, 30] and language [4, 41, 20].
Unlike VGG-nets, the family of Inception models [38,
17, 39, 37] have demonstrated that carefully designed
v:1611.05431v2[cs.CV]11Apr2017
Densely Connected Convolutional Networks
Gao Huang⇤
Cornell University
gh349@cornell.edu
Zhuang Liu⇤
Tsinghua University
liuzhuang13@mails.tsinghua.edu.cn
Laurens van der Maaten
Facebook AI Research
lvdmaaten@fb.com
Kilian Q. Weinberger
Cornell University
kqw4@cornell.edu
Abstract
Recent work has shown that convolutional networks can
be substantially deeper, more accurate, and efficient to train
if they contain shorter connections between layers close to
the input and those close to the output. In this paper, we
embrace this observation and introduce the Dense Convo-
lutional Network (DenseNet), which connects each layer
to every other layer in a feed-forward fashion. Whereas
traditional convolutional networks with L layers have L
connections—one between each layer and its subsequent
layer—our network has L(L+1)
2 direct connections. For
each layer, the feature-maps of all preceding layers are
used as inputs, and its own feature-maps are used as inputs
into all subsequent layers. DenseNets have several com-
pelling advantages: they alleviate the vanishing-gradient
problem, strengthen feature propagation, encourage fea-
ture reuse, and substantially reduce the number of parame-
ters. We evaluate our proposed architecture on four highly
competitive object recognition benchmark tasks (CIFAR-10,
CIFAR-100, SVHN, and ImageNet). DenseNets obtain sig-
nificant improvements over the state-of-the-art on most of
them, whilst requiring less computation to achieve high per-
formance. Code and pre-trained models are available at
https://github.com/liuzhuang13/DenseNet.
1. Introduction
Convolutional neural networks (CNNs) have become
the dominant machine learning approach for visual object
recognition. Although they were originally introduced over
20 years ago [18], improvements in computer hardware and
network structure have enabled the training of truly deep
CNNs only recently. The original LeNet5 [19] consisted of
5 layers, VGG featured 19 [29], and only last year Highway
⇤Authors contributed equally
x0
x1
H1
x2
H2
H3
H4
x3
x4
Figure 1: A 5-layer dense block with a growth rate of k = 4.
Each layer takes all preceding feature-maps as input.
Networks [34] and Residual Networks (ResNets) [11] have
surpassed the 100-layer barrier.
As CNNs become increasingly deep, a new research
problem emerges: as information about the input or gra-
dient passes through many layers, it can vanish and “wash
out” by the time it reaches the end (or beginning) of the
network. Many recent publications address this or related
problems. ResNets [11] and Highway Networks [34] by-
pass signal from one layer to the next via identity connec-
tions. Stochastic depth [13] shortens ResNets by randomly
dropping layers during training to allow better information
and gradient flow. FractalNets [17] repeatedly combine sev-
eral parallel layer sequences with different number of con-
volutional blocks to obtain a large nominal depth, while
maintaining many short paths in the network. Although
these different approaches vary in network topology and
training procedure, they all share a key characteristic: they
create short paths from early layers to later layers.
1
arXiv:1608.06993v5[cs.CV]28Jan2018