SlideShare a Scribd company logo
JOINT DEEP EXPLOITATION OF SEMANTIC KEYWORDS AND VISUAL FEATURES FOR
MALICIOUS CROWD IMAGE CLASSIFICATION
Joel Levis1
, Hyungtae Lee23
, Heesung Kwon3
, James Michaelis3
, Michael Kolodny3
, and Sungmin Eum34
1
Ohio University, Athens, Ohio, U.S.A.
2
Booz Allen Hamilton Inc., McLean, Virginia U.S.A.
3
U.S. Army Research Laboratory, Adelphi, Maryland, U.S.A.
4
University of Maryland, College Park, Maryland, U.S.A.
jl359113@ohio.edu, lee hyungtae@bah.com, heesung.kwon.civ@mail.mil
james.r.michaelis2.civ@mail.mil, michael.a.kolodny.ctr@mail.mil, smeum@umiacs.umd.edu
ABSTRACT
General image classification approaches differentiate classes
using strong distinguishing features but some classes cannot
be easily separated because of very similar visual features.
To deal with this problem, we can use keywords relevant to
a particular class. To implement this concept we have newly
constructed a malicious crowd dataset which contains crowd
images with two events, benign and malicious, which look
similar yet involve opposite semantic events. We also created
a set of five malicious event-relevant keywords such as police
and fire. In the evaluation, integrating malicious event classi-
fication with recognition output of these keywords enhances
the overall performance on the malicious crowd dataset.
Index Terms— malicious crowd dataset, semantic key-
word, image classification
1. INTRODUCTION
General image classification methods have drawn upon the
fact that images of differing classes have strong distinguish-
ing features. [1, 2, 3, 4] However, certain classes involve very
different events but can be represented with very similar im-
age features, such as objects, that mainly appeared in the asso-
ciated images. For example, in Figure 1, two images seem to
contain similar event because persons are outstanding in both
images. We can discern, however, that the two images involve
opposite semantic events, which are benign and malicious.
The right image is malicious due to several odd objects, such
as smoke and police equipment. General image classification
may not perform well without semantically crucial object in-
formation, which may or may not be notable from the im-
age, but can still be important keywords to guess which event
occurs. We address this problem by identifying semantically
unique keywords, which occur in higher frequency among the
malicious images, and use these identified words to improve
classification accuracy.
benign malicious
Relevant keywords
street, store, sign, flower, people
Relevant keywords
police, smoke, protest, crowd, fire
Fig. 1. A pair of similar looking crowd images with unique
object contents
Since most benchmark datasets [5, 6, 7] collected for
event classification do not deal with this problem, we col-
lected a novel “malicious crowd” dataset, which contains
crowd images with two events: benign and malicious. Along
with event-level labels, we also collected a number of key-
words that appeared in each image in the dataset, as listed
below each image in Figure 1. We used Amazon Mechani-
cal Turk to describe the semantic contents of each image in
terms of keywords. Then we collected all the keywords for
both classes and created a set of words used at most for each
event. We selected non-overlapping distinctive keywords for
the malicious event, which we aim to identify and treat them
as the representative “semantic keywords”.
To identify semantic keywords from a test image, we used
a well known detection method, the deformable part model
(DPM) [8], and a classification algorithm, which is a fine-
tuned AlexNet [9]. Among various keywords, some such as
police, helmet, and car have rigid appearance but the others
such as fire and smoke do not. DPM was used to detect the
objects with rigid appearance whereas the finetuned AlexNet
was employed to detect less rigid objects such as smoke and
fire. We also built an additional fine-tuned AlexNet archi-
tecture to classify benign/malicious crowd images. Finally,
we used several late fusion approaches to integrate the mali-
benign
malicious
(a)
category keywords
benign crowd, people, city, building, men, women, group, road, sidewalk, sign, race, tree, event, fans, gathering, . . .
malicious crowd, people, protext, police, fire, street, riot, city, building, smoke, men, sign, flag, night, man, helmet, signs, group, violence, car, . . .
(b)
Fig. 2. Malicious Crowd Dataset: (a) several example images for the benign and malicious events are shown in the first
and second rows, respectively. (b) keywords mainly seen in the images for the benign and malicious images are listed. Red
keywords are relevant keywords for the malicious event.
cious crowd image classification result with the keyword de-
tection/classification results. Our experiments show that fu-
sion of image and keyword classifications outperforms the
case when only the image classification is used. This supports
the effectiveness of exploiting semantic keyword relevant to
the malicious crowd images.
Our contributions are summarized as follows:
1. We introduce a new task of image classification where
classes cannot be easily seperated from each other un-
like general image classification.
2. To deal with this problem we collect a malicious crowd
dataset which consists of two classes, malicious and be-
nign crowds, which look similar but contain opposite
semantic events.
3. We exploit semantic keywords only relevant to ma-
licious crowd images to differentiate the malicious
crowd images from the benign ones.
4. Integrating image features with these semantic key-
word information increases image classification accu-
racy in the malicious crowd dataset.
2. MALICIOUS CROWD DATASET AND SEMANTIC
KEYWORDS
2.1. Malicious Crowd Dataset
The “malicious crowd” dataset that was used to test our hy-
pothesis contains 1133 crowd images equally split into two
classes: benign and malicious. The intuition behind the label-
ing of the images was that, a benign crowd would be some-
thing a passerby would not be alarmed or concerned to see,
while a malicious image would be alarming and potentially
dangerous.
0
10
20
30
40
50
60
crowdpeopleprote
stpolice
firestreet
rio
t
citybuildin
gsm
oke
m
en
sig
n
fla
g
nig
ht
m
anhelm
etsig
ns
groupvio
le
nce
car
0
10
20
30
40
50
60
crowdpeople
street
citybuildin
g
m
enwom
engroup
road
sid
ewalk
sig
n
race
tree
event
fans
fla
g
gath
erin
g
frequency(%)
keywords
benign malicious
Fig. 3. Histograms of relevant keywords: The left and right
histogram show the histograms of keywords relevant to be-
nign and malicious classes, respectively. The keywords are
listed according to their frequency of appearance in the im-
ages.
These images were gathered using Google Images using
various search terms. For benign images, search terms such as
marathon, pedestrian crowd, parade, and concert were used.
Riot and protest were used as search terms to gather the ma-
licious crowd images. Figure 2(a) illustrates some example
images from each class.
2.2. Semantic Keywords
To describe the contents of each of the crowd images, Ama-
zon Mechanical Turk was used. A human was responsible for
assigning five keywords to each image based on what objects
are observed within. To ensure the accuracy of the Machan-
ical Turk results, we manually removed the keywords which
were incorrectly assigned.
After successfully collecting the crowd images and corre-
sponding keywords, identifying keywords only relevant to the
malicious class was necessary. We then constructed two key-
word sets, each acquired by selecting the most frequently ap-
pearing keywords in the two given classes. In practice, words
that are commonly annotated in 5% or more images in each
Table 1. Number of images where each keyword relevant to
the malicious event appears.
class images police fire smoke helmet car
benign 557 8 1 2 7 57
malicious 576 205 144 150 206 65
class were selected. As a result of this thresholding, the num-
bers of selected words for the benign and malicious classes
are 17 and 20, respectively. Selected words and those fre-
quency for both classes can be seen in Figure 3. We have re-
fined the malicious keyword set by eliminating the keywords
that appear in both classes. This elimination resulted in nine
malicious keywords as shown in red in Figure 2(b). Lastly, we
further eliminated keywords indicating particular phenomena
such as protest, riot, night, and violence. Then police, fire,
smoke, helmet, and car were included in the final set of mali-
cious semantic keywords.
Table 2.2 shows the number of images where each key-
word (object) actually appears. While police, fire, smoke, and
helmet seem to be closely associated with the malicious event,
car is seen in both events with a similar frequency. Note that
the numbers in the table do not necessarily match the his-
togram of malicious semantic keywords obtained from Ama-
zon Mechanical Turk. For example, police appears in 205
out of all 576 malicious images at a rate of 35.59%, but is
assigned only to 28.50% of the malicious image by Amazon
Mechanical Turk. This is because the visual contents associ-
ated with these keywords are not overly notable in several im-
ages. We can observe that the frequencies of the selected se-
mantic keywords show a notable gap between the two classes,
indicating that the purpose of the proposed keyword selection
process is achieved.
3. THE PROPOSED APPROACH
To identify semantic keywords from the test images, keyword
detectors/classifiers were trained. For objects with rigid ap-
pearance such as police, helmet, and car, deformable part
models (DPM) [8] were trained. For fire and smoke which
are objects with non-rigid appearance, convolutional neural
network (CNN) classifiers finetuned on the AlexNet architec-
ture [9] were used. Since the object detectors output multiple
detections for an image, we select one detection with a max-
imum score and use that score to represent the confidence of
the object presence in that image. We also built a CNN clas-
sifier to output the confidence score for the maliciousness of
an image. Multiple late fusion approaches were utilized to
combine the output of all keyword detectors/classifiers and
the malicious image classifier.
3.1. Learning Keyword Detectors
DPM detectors which are used to identify police and helmet
were trained on 400 annotated images, made up of all auxil-
iary images from Google Images. For a car detector, we used
the DPM trained on PASCAL VOC 2007 dataset [10].
3.2. Learning Malicious Event/Keyword Classifiers
Firstly, a finetuned AlexNet deep convolutional neural net-
work (DCNN) was trained to classify images as benign or
malicious. The training set includes 905 images randomly se-
lected from the malicious crowd dataset. Finetuning was con-
ducted on all eight layers of AlexNet, with the eighth layer
learning with a learning rate of 20 and a learning rate of 2 for
all others. The last layer was replaced so as to have a binary
output in contrast to the 1000 class output of AlexNet.
The fire and smoke DCNN-based classifiers were also
trained in a similar way to the previously described DCNN.
Each of these models was trained on 300 images. These
contain images from our dataset and the auxiliary images
gathered from Google Images. We used seperate networks
for the two keywords instead of one network with multiple la-
bels because both keywords may appear in the same training
image.
3.3. Late Fusion
A late fusion was performed on the output of the six streams
which are the malicious crowd image classifier, three detec-
tors for police, helmet, and car, and two classifiers for fire and
smoke. The late fusion is used to enhance the baseline classi-
fier with the thought that additional object information would
help to increase classification accuracy. In an attempt to test
which fusion method would be most effective, the streams
were tested using various fusion methods. These included
Linear Discriminant Analysis (LD) [11], Logistic Regression
(LR) [12], Support Vector Machines (SVM) [13], k-Nearest
Neighbor Classifiers (kNN) [14], Subspace-based Ensemble
Classifiers (EC) [15], and a Dynamic Belief Fusion (DBF)
[16]. For SVM, we used two different kernels which are a lin-
ear kernel (SVM-lin) and RBF kernel (SVM-rbf). For kNN,
we used 100 clusters and these clusters are clustered accord-
ing to the Euclidean distance. As the EC, we used a subspace
ensemble classifier with a set of 30 weak models.
4. EXPERIMENTS
4.1. Dataset Partition and Evaluation Protocol
The Malicious Crowd Dataset consists of 1133 images - 576
of 1133 are labeled as the malicious crowd image and the
rest are labeled as benign. The same training dataset men-
tioned in Section 3.2 (905 images) is used to train the fusion
approaches. The rest (228 images) are used as the test set.
malicious crowd: 0.9889
fire: 0.5951
smoke: 0.4704
late fusion: 0.7149
malicious crowd: 0.9467
fire: 0.8124
smoke: 0.7568
late fusion: 0.6641
malicious crowd: 0.9161
fire: 0.9664
smoke: 0.7447
late fusion: 0.6408
malicious crowd: 0.9713
fire: 0.9646
smoke: 0.8696
late fusion: 0.6323
malicious crowd: 0.9917
fire: 0.9998
smoke: 0.9755
late fusion: 0.6458
malicious crowd: 0.9560
fire: 0.8126
smoke: 0.4524
late fusion: 0.6338
malicious crowd: 0.6913
fire: 0.4499
smoke: 0.2671
late fusion: 0.5652
malicious crowd: 0.8942
fire: 0.5211
smoke: 0.0564
late fusion: 0.5604
-0.0009-0.0009 -0.0002-0.0008 -0.8191
-0.4941
-0.9403
-0.0007
-0.0007
-0.9282 -0.8655 -0.8554 -0.7348
-0.7150
-0.9376
-0.8387
0.0030
0.0020
0.0040
0.0020
0.0030
-0.9335
-0.9042
-0.8583
-0.8004
-0.8827
-0.9198
-0.8925
-0.9344
-0.9225
-0.0868
malicious
benign
Fig. 4. Output of malicious crowd image classification and keyword detectors/classifiers: the first and second row show
four examples with largest fusion scores for malicious event from malicious and benign crowd images, respectively. Bounding
box with color of red, green, and blue indicates detection of police, helmet, and car detector, respectively. A late fusion score is
obtained by EC (a subspace ensemble classifier).
Table 2. Malicious Crowd Image Classification Accuracy measured by AP
keyword late fusion
baseline police fire smoke helmet car SVM-rbf DBF SVM-lin kNN LD LR EC
AP .722 .586 .563 .689 .532 .491 742 .757 .758 .758 .760 .763 .771
Gain · · · · · · +.020 +.035 +.036 +.036 +.038 +.041 +.049
Averge precision (AP) is used as an evaluation metric in our
experiments.
4.2. Results
Table 2 shows a malicious crowd image classification ac-
curacy in AP for a baseline malicious crowd image clas-
sification, keyword detections/classifications, and various
late fusion approaches. Note that, for a keyword detec-
tion/classification, classification accuracy was calculated for
recognizing malicious image instead of each associated key-
word. For example, when the test image is originally ma-
licious while not containing any police in the image, if the
police detector does not detect any police in the image, the
result is still considered false negative. Using the car detector
does not provide competitive accuracy because, as shown in
the Table 2.2, car is not significantly relevant to the mali-
cious crowd. Other keyword detectors do not provide better
classification accuracy than baseline malicious crowd image
classification. This is because these sematic keywords (ob-
jects) are only seen in small portions in the dataset. However,
integrating the baseline with the output of these keyword
classifiers/detectors enhanced the classification accuracy by
approximately 7% at most. The best performer is EC, a
subspace-based ensemble classifier, achieving fusion gain
of .049 in AP. We can observe that all fusion approaches
improve classification accuracy over the baseline, which sup-
ports the benefit of jointly exploiting semantic keywords and
the associated detectors and classifiers. Figure 4 shows sev-
eral images with high scores in terms of their maliciousness
from both malicious and benign classes.
5. CONCLUSION
We addressed the new image classification problem where
certain classes can be expressed by similar visual features
but should be distinguished from each other semantically. To
demonstrate, we have constructed a novel malicious crowd
image dataset which consists of two classes (benign and mali-
cious) that may look similar but contain semantically different
events. To better classify the images with the aforementioned
characteristics, we have selected representative keywords for
malicious crowd images which are then incorporated with
conventional image classifiiers using a multi-stream late fu-
sion architecture. As Table 2 shows, the approach that we
have hypothesized lead to considerable performance improve-
ments over the conventional baseline classifier when used in
practice.
6. REFERENCES
[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin-
ton, “Imagenet classification with deep convolutional
neural networks,” Advances in Neural Information Pro-
cessing Systems 25, 2012.
[2] Maxime Oquab, L´eon Bottou, Ivan Laptev, and Josef
Sivic, “Is object localization for free? – weakly-
supervised learning with convolutional neural net-
works,” Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, 2015.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun, “Deep residual learning for image recogni-
tion,” IEEE Conference on Computer Vision and Pattern
Recognition, 2015.
[4] Archith J. Bency, Heesung Kwon, Hyungtae Lee,
S Karthikeyan, and B. S. Manjunath, “Weakly super-
vised localization using deep feature maps,” European
Conference on Computer Vision, 2016.
[5] Li-Jia Li and Li Fei-Fei, “What, where and who? clas-
sifying event by scene and object recognition,” IEEE
International Conference on Computer Vision, 2007.
[6] Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh
Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit
Mukherjee, JK Aggarwal, Hyungtae Lee, Larry Davis,
et al., “A large-scale benchmark dataset for event recog-
nition in surveillance video,” IEEE Conference on Com-
puter Vision and Pattern Recognition, 2011.
[7] George Awad, Jonathan Fiscus, Martial Michel, David
Joy, Wessel Kraaij, Alan F. Smeaton, Georges Qu´eenot,
Maria Eskevich, Robin Aly, and Roeland Ordelman,
“Trecvid 2016: Evaluating video search, video event de-
tection, localization, and hyperlinking,” in Proceedings
of TRECVID 2016. NIST, USA, 2016.
[8] Pedro F. Felzenszwalb, Ross B. Girshick, David
McAllester, and Deva Ramanan, “Object detection
with discriminatively trained part based models,” IEEE
Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 32, no. 9, pp. 1627–1645, 2010.
[9] Maxime Oquab, L´eon Bottou, Ivan Laptev, and Josef
Sivic, “Learning and transferring mid-level image repre-
sentations using convolutional neural networks,” IEEE
Conference on Computer Vision and Pattern Recogni-
tion, 2014.
[10] Mark Everingham, Luc Van Gool, Christopher
K. I. Williams, John Winn, and Andrew Zisserman,
“The PASCAL Visual Object Classes Challenge
2007 (VOC2007) Results,” http://www.pascal-
network.org/challenges/VOC/voc2007/workshop/index.html.
[11] Ronald Alymer Fisher, “The use of multiple measure-
ments in taxonomic problems,” Annals of Eugenics, vol.
7, pp. 179–188, 1936.
[12] David A. Freedman, “Statistical models: Theory and
practice,” p. 128. Cambridge University Press, 2009.
[13] C Cortes and V. Vapnik, “Support-vector networks,”
Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
[14] N. S. Altman, “An introduction to kernel and nearest-
neighbor nonparametric regression,” The American
Statistician, vol. 46, no. 3, pp. 175–185, 1992.
[15] Tin Kam Ho, “The random subspace method for con-
structing decision forests,” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, vol. 20, no. 8,
pp. 832–844, 1998.
[16] Hyungtae Lee, Heesung Kwon, Ryan M. Robinson,
William D. Nothwang, and Amar M. Marathe, “Dy-
namic belief fusion for object detection,” IEEE Winter
Conference on Applications of Computer Vision, 2016.

More Related Content

Similar to levis

A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...
IRJET Journal
 
GROUPING OBJECTS BASED ON THEIR APPEARANCE
GROUPING OBJECTS BASED ON THEIR APPEARANCEGROUPING OBJECTS BASED ON THEIR APPEARANCE
GROUPING OBJECTS BASED ON THEIR APPEARANCE
ijaia
 
THE LORE OF SPECULATION AND ANALYSIS USING MACHINE LEARNING AND IMAGE MATCHING
THE LORE OF SPECULATION AND ANALYSIS USING  MACHINE LEARNING AND IMAGE MATCHINGTHE LORE OF SPECULATION AND ANALYSIS USING  MACHINE LEARNING AND IMAGE MATCHING
THE LORE OF SPECULATION AND ANALYSIS USING MACHINE LEARNING AND IMAGE MATCHING
IJTRET-International Journal of Trendy Research in Engineering and Technology
 
A03203001004
A03203001004A03203001004
A03203001004
theijes
 
Tackling_Android_Stego_Apps_in_the_Wild.pdf
Tackling_Android_Stego_Apps_in_the_Wild.pdfTackling_Android_Stego_Apps_in_the_Wild.pdf
Tackling_Android_Stego_Apps_in_the_Wild.pdf
med_univ78
 
Image–based face-detection-and-recognition-using-matlab
Image–based face-detection-and-recognition-using-matlabImage–based face-detection-and-recognition-using-matlab
Image–based face-detection-and-recognition-using-matlab
Ijcem Journal
 
Detection of Fake Accounts in Instagram Using Machine Learning
Detection of Fake Accounts in Instagram Using Machine LearningDetection of Fake Accounts in Instagram Using Machine Learning
Detection of Fake Accounts in Instagram Using Machine Learning
AIRCC Publishing Corporation
 
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNINGDETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
ijcsit
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
csandit
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
csandit
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
csandit
 
Face Recognition Techniques - An evaluation Study
Face Recognition Techniques - An evaluation StudyFace Recognition Techniques - An evaluation Study
Face Recognition Techniques - An evaluation Study
Eswar Publications
 
Faces in the Distorting Mirror: Revisiting Photo-based Social Authentication
Faces in the Distorting Mirror: Revisiting Photo-based Social AuthenticationFaces in the Distorting Mirror: Revisiting Photo-based Social Authentication
Faces in the Distorting Mirror: Revisiting Photo-based Social Authentication
FACE
 
Criminals presentation
Criminals presentationCriminals presentation
Criminals presentation
Annye Braca
 
Image Classification and Annotation Using Deep Learning
Image Classification and Annotation Using Deep LearningImage Classification and Annotation Using Deep Learning
Image Classification and Annotation Using Deep Learning
IRJET Journal
 
Face identification
Face  identificationFace  identification
Face identification
27vipin92
 
The method of comparing two image files
 The method of comparing two image files The method of comparing two image files
The method of comparing two image files
Minh Anh Nguyen
 
The method of comparing two image files
The method of comparing two image filesThe method of comparing two image files
The method of comparing two image files
Minh Anh Nguyen
 
Report
ReportReport
Report
Harsh Parikh
 
Ijarcet vol-2-issue-4-1374-1382
Ijarcet vol-2-issue-4-1374-1382Ijarcet vol-2-issue-4-1374-1382
Ijarcet vol-2-issue-4-1374-1382
Editor IJARCET
 

Similar to levis (20)

A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...
 
GROUPING OBJECTS BASED ON THEIR APPEARANCE
GROUPING OBJECTS BASED ON THEIR APPEARANCEGROUPING OBJECTS BASED ON THEIR APPEARANCE
GROUPING OBJECTS BASED ON THEIR APPEARANCE
 
THE LORE OF SPECULATION AND ANALYSIS USING MACHINE LEARNING AND IMAGE MATCHING
THE LORE OF SPECULATION AND ANALYSIS USING  MACHINE LEARNING AND IMAGE MATCHINGTHE LORE OF SPECULATION AND ANALYSIS USING  MACHINE LEARNING AND IMAGE MATCHING
THE LORE OF SPECULATION AND ANALYSIS USING MACHINE LEARNING AND IMAGE MATCHING
 
A03203001004
A03203001004A03203001004
A03203001004
 
Tackling_Android_Stego_Apps_in_the_Wild.pdf
Tackling_Android_Stego_Apps_in_the_Wild.pdfTackling_Android_Stego_Apps_in_the_Wild.pdf
Tackling_Android_Stego_Apps_in_the_Wild.pdf
 
Image–based face-detection-and-recognition-using-matlab
Image–based face-detection-and-recognition-using-matlabImage–based face-detection-and-recognition-using-matlab
Image–based face-detection-and-recognition-using-matlab
 
Detection of Fake Accounts in Instagram Using Machine Learning
Detection of Fake Accounts in Instagram Using Machine LearningDetection of Fake Accounts in Instagram Using Machine Learning
Detection of Fake Accounts in Instagram Using Machine Learning
 
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNINGDETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
 
Violent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence ClusteringViolent Scenes Detection Using Mid-Level Violence Clustering
Violent Scenes Detection Using Mid-Level Violence Clustering
 
Face Recognition Techniques - An evaluation Study
Face Recognition Techniques - An evaluation StudyFace Recognition Techniques - An evaluation Study
Face Recognition Techniques - An evaluation Study
 
Faces in the Distorting Mirror: Revisiting Photo-based Social Authentication
Faces in the Distorting Mirror: Revisiting Photo-based Social AuthenticationFaces in the Distorting Mirror: Revisiting Photo-based Social Authentication
Faces in the Distorting Mirror: Revisiting Photo-based Social Authentication
 
Criminals presentation
Criminals presentationCriminals presentation
Criminals presentation
 
Image Classification and Annotation Using Deep Learning
Image Classification and Annotation Using Deep LearningImage Classification and Annotation Using Deep Learning
Image Classification and Annotation Using Deep Learning
 
Face identification
Face  identificationFace  identification
Face identification
 
The method of comparing two image files
 The method of comparing two image files The method of comparing two image files
The method of comparing two image files
 
The method of comparing two image files
The method of comparing two image filesThe method of comparing two image files
The method of comparing two image files
 
Report
ReportReport
Report
 
Ijarcet vol-2-issue-4-1374-1382
Ijarcet vol-2-issue-4-1374-1382Ijarcet vol-2-issue-4-1374-1382
Ijarcet vol-2-issue-4-1374-1382
 

levis

  • 1. JOINT DEEP EXPLOITATION OF SEMANTIC KEYWORDS AND VISUAL FEATURES FOR MALICIOUS CROWD IMAGE CLASSIFICATION Joel Levis1 , Hyungtae Lee23 , Heesung Kwon3 , James Michaelis3 , Michael Kolodny3 , and Sungmin Eum34 1 Ohio University, Athens, Ohio, U.S.A. 2 Booz Allen Hamilton Inc., McLean, Virginia U.S.A. 3 U.S. Army Research Laboratory, Adelphi, Maryland, U.S.A. 4 University of Maryland, College Park, Maryland, U.S.A. jl359113@ohio.edu, lee hyungtae@bah.com, heesung.kwon.civ@mail.mil james.r.michaelis2.civ@mail.mil, michael.a.kolodny.ctr@mail.mil, smeum@umiacs.umd.edu ABSTRACT General image classification approaches differentiate classes using strong distinguishing features but some classes cannot be easily separated because of very similar visual features. To deal with this problem, we can use keywords relevant to a particular class. To implement this concept we have newly constructed a malicious crowd dataset which contains crowd images with two events, benign and malicious, which look similar yet involve opposite semantic events. We also created a set of five malicious event-relevant keywords such as police and fire. In the evaluation, integrating malicious event classi- fication with recognition output of these keywords enhances the overall performance on the malicious crowd dataset. Index Terms— malicious crowd dataset, semantic key- word, image classification 1. INTRODUCTION General image classification methods have drawn upon the fact that images of differing classes have strong distinguish- ing features. [1, 2, 3, 4] However, certain classes involve very different events but can be represented with very similar im- age features, such as objects, that mainly appeared in the asso- ciated images. For example, in Figure 1, two images seem to contain similar event because persons are outstanding in both images. We can discern, however, that the two images involve opposite semantic events, which are benign and malicious. The right image is malicious due to several odd objects, such as smoke and police equipment. General image classification may not perform well without semantically crucial object in- formation, which may or may not be notable from the im- age, but can still be important keywords to guess which event occurs. We address this problem by identifying semantically unique keywords, which occur in higher frequency among the malicious images, and use these identified words to improve classification accuracy. benign malicious Relevant keywords street, store, sign, flower, people Relevant keywords police, smoke, protest, crowd, fire Fig. 1. A pair of similar looking crowd images with unique object contents Since most benchmark datasets [5, 6, 7] collected for event classification do not deal with this problem, we col- lected a novel “malicious crowd” dataset, which contains crowd images with two events: benign and malicious. Along with event-level labels, we also collected a number of key- words that appeared in each image in the dataset, as listed below each image in Figure 1. We used Amazon Mechani- cal Turk to describe the semantic contents of each image in terms of keywords. Then we collected all the keywords for both classes and created a set of words used at most for each event. We selected non-overlapping distinctive keywords for the malicious event, which we aim to identify and treat them as the representative “semantic keywords”. To identify semantic keywords from a test image, we used a well known detection method, the deformable part model (DPM) [8], and a classification algorithm, which is a fine- tuned AlexNet [9]. Among various keywords, some such as police, helmet, and car have rigid appearance but the others such as fire and smoke do not. DPM was used to detect the objects with rigid appearance whereas the finetuned AlexNet was employed to detect less rigid objects such as smoke and fire. We also built an additional fine-tuned AlexNet archi- tecture to classify benign/malicious crowd images. Finally, we used several late fusion approaches to integrate the mali-
  • 2. benign malicious (a) category keywords benign crowd, people, city, building, men, women, group, road, sidewalk, sign, race, tree, event, fans, gathering, . . . malicious crowd, people, protext, police, fire, street, riot, city, building, smoke, men, sign, flag, night, man, helmet, signs, group, violence, car, . . . (b) Fig. 2. Malicious Crowd Dataset: (a) several example images for the benign and malicious events are shown in the first and second rows, respectively. (b) keywords mainly seen in the images for the benign and malicious images are listed. Red keywords are relevant keywords for the malicious event. cious crowd image classification result with the keyword de- tection/classification results. Our experiments show that fu- sion of image and keyword classifications outperforms the case when only the image classification is used. This supports the effectiveness of exploiting semantic keyword relevant to the malicious crowd images. Our contributions are summarized as follows: 1. We introduce a new task of image classification where classes cannot be easily seperated from each other un- like general image classification. 2. To deal with this problem we collect a malicious crowd dataset which consists of two classes, malicious and be- nign crowds, which look similar but contain opposite semantic events. 3. We exploit semantic keywords only relevant to ma- licious crowd images to differentiate the malicious crowd images from the benign ones. 4. Integrating image features with these semantic key- word information increases image classification accu- racy in the malicious crowd dataset. 2. MALICIOUS CROWD DATASET AND SEMANTIC KEYWORDS 2.1. Malicious Crowd Dataset The “malicious crowd” dataset that was used to test our hy- pothesis contains 1133 crowd images equally split into two classes: benign and malicious. The intuition behind the label- ing of the images was that, a benign crowd would be some- thing a passerby would not be alarmed or concerned to see, while a malicious image would be alarming and potentially dangerous. 0 10 20 30 40 50 60 crowdpeopleprote stpolice firestreet rio t citybuildin gsm oke m en sig n fla g nig ht m anhelm etsig ns groupvio le nce car 0 10 20 30 40 50 60 crowdpeople street citybuildin g m enwom engroup road sid ewalk sig n race tree event fans fla g gath erin g frequency(%) keywords benign malicious Fig. 3. Histograms of relevant keywords: The left and right histogram show the histograms of keywords relevant to be- nign and malicious classes, respectively. The keywords are listed according to their frequency of appearance in the im- ages. These images were gathered using Google Images using various search terms. For benign images, search terms such as marathon, pedestrian crowd, parade, and concert were used. Riot and protest were used as search terms to gather the ma- licious crowd images. Figure 2(a) illustrates some example images from each class. 2.2. Semantic Keywords To describe the contents of each of the crowd images, Ama- zon Mechanical Turk was used. A human was responsible for assigning five keywords to each image based on what objects are observed within. To ensure the accuracy of the Machan- ical Turk results, we manually removed the keywords which were incorrectly assigned. After successfully collecting the crowd images and corre- sponding keywords, identifying keywords only relevant to the malicious class was necessary. We then constructed two key- word sets, each acquired by selecting the most frequently ap- pearing keywords in the two given classes. In practice, words that are commonly annotated in 5% or more images in each
  • 3. Table 1. Number of images where each keyword relevant to the malicious event appears. class images police fire smoke helmet car benign 557 8 1 2 7 57 malicious 576 205 144 150 206 65 class were selected. As a result of this thresholding, the num- bers of selected words for the benign and malicious classes are 17 and 20, respectively. Selected words and those fre- quency for both classes can be seen in Figure 3. We have re- fined the malicious keyword set by eliminating the keywords that appear in both classes. This elimination resulted in nine malicious keywords as shown in red in Figure 2(b). Lastly, we further eliminated keywords indicating particular phenomena such as protest, riot, night, and violence. Then police, fire, smoke, helmet, and car were included in the final set of mali- cious semantic keywords. Table 2.2 shows the number of images where each key- word (object) actually appears. While police, fire, smoke, and helmet seem to be closely associated with the malicious event, car is seen in both events with a similar frequency. Note that the numbers in the table do not necessarily match the his- togram of malicious semantic keywords obtained from Ama- zon Mechanical Turk. For example, police appears in 205 out of all 576 malicious images at a rate of 35.59%, but is assigned only to 28.50% of the malicious image by Amazon Mechanical Turk. This is because the visual contents associ- ated with these keywords are not overly notable in several im- ages. We can observe that the frequencies of the selected se- mantic keywords show a notable gap between the two classes, indicating that the purpose of the proposed keyword selection process is achieved. 3. THE PROPOSED APPROACH To identify semantic keywords from the test images, keyword detectors/classifiers were trained. For objects with rigid ap- pearance such as police, helmet, and car, deformable part models (DPM) [8] were trained. For fire and smoke which are objects with non-rigid appearance, convolutional neural network (CNN) classifiers finetuned on the AlexNet architec- ture [9] were used. Since the object detectors output multiple detections for an image, we select one detection with a max- imum score and use that score to represent the confidence of the object presence in that image. We also built a CNN clas- sifier to output the confidence score for the maliciousness of an image. Multiple late fusion approaches were utilized to combine the output of all keyword detectors/classifiers and the malicious image classifier. 3.1. Learning Keyword Detectors DPM detectors which are used to identify police and helmet were trained on 400 annotated images, made up of all auxil- iary images from Google Images. For a car detector, we used the DPM trained on PASCAL VOC 2007 dataset [10]. 3.2. Learning Malicious Event/Keyword Classifiers Firstly, a finetuned AlexNet deep convolutional neural net- work (DCNN) was trained to classify images as benign or malicious. The training set includes 905 images randomly se- lected from the malicious crowd dataset. Finetuning was con- ducted on all eight layers of AlexNet, with the eighth layer learning with a learning rate of 20 and a learning rate of 2 for all others. The last layer was replaced so as to have a binary output in contrast to the 1000 class output of AlexNet. The fire and smoke DCNN-based classifiers were also trained in a similar way to the previously described DCNN. Each of these models was trained on 300 images. These contain images from our dataset and the auxiliary images gathered from Google Images. We used seperate networks for the two keywords instead of one network with multiple la- bels because both keywords may appear in the same training image. 3.3. Late Fusion A late fusion was performed on the output of the six streams which are the malicious crowd image classifier, three detec- tors for police, helmet, and car, and two classifiers for fire and smoke. The late fusion is used to enhance the baseline classi- fier with the thought that additional object information would help to increase classification accuracy. In an attempt to test which fusion method would be most effective, the streams were tested using various fusion methods. These included Linear Discriminant Analysis (LD) [11], Logistic Regression (LR) [12], Support Vector Machines (SVM) [13], k-Nearest Neighbor Classifiers (kNN) [14], Subspace-based Ensemble Classifiers (EC) [15], and a Dynamic Belief Fusion (DBF) [16]. For SVM, we used two different kernels which are a lin- ear kernel (SVM-lin) and RBF kernel (SVM-rbf). For kNN, we used 100 clusters and these clusters are clustered accord- ing to the Euclidean distance. As the EC, we used a subspace ensemble classifier with a set of 30 weak models. 4. EXPERIMENTS 4.1. Dataset Partition and Evaluation Protocol The Malicious Crowd Dataset consists of 1133 images - 576 of 1133 are labeled as the malicious crowd image and the rest are labeled as benign. The same training dataset men- tioned in Section 3.2 (905 images) is used to train the fusion approaches. The rest (228 images) are used as the test set.
  • 4. malicious crowd: 0.9889 fire: 0.5951 smoke: 0.4704 late fusion: 0.7149 malicious crowd: 0.9467 fire: 0.8124 smoke: 0.7568 late fusion: 0.6641 malicious crowd: 0.9161 fire: 0.9664 smoke: 0.7447 late fusion: 0.6408 malicious crowd: 0.9713 fire: 0.9646 smoke: 0.8696 late fusion: 0.6323 malicious crowd: 0.9917 fire: 0.9998 smoke: 0.9755 late fusion: 0.6458 malicious crowd: 0.9560 fire: 0.8126 smoke: 0.4524 late fusion: 0.6338 malicious crowd: 0.6913 fire: 0.4499 smoke: 0.2671 late fusion: 0.5652 malicious crowd: 0.8942 fire: 0.5211 smoke: 0.0564 late fusion: 0.5604 -0.0009-0.0009 -0.0002-0.0008 -0.8191 -0.4941 -0.9403 -0.0007 -0.0007 -0.9282 -0.8655 -0.8554 -0.7348 -0.7150 -0.9376 -0.8387 0.0030 0.0020 0.0040 0.0020 0.0030 -0.9335 -0.9042 -0.8583 -0.8004 -0.8827 -0.9198 -0.8925 -0.9344 -0.9225 -0.0868 malicious benign Fig. 4. Output of malicious crowd image classification and keyword detectors/classifiers: the first and second row show four examples with largest fusion scores for malicious event from malicious and benign crowd images, respectively. Bounding box with color of red, green, and blue indicates detection of police, helmet, and car detector, respectively. A late fusion score is obtained by EC (a subspace ensemble classifier). Table 2. Malicious Crowd Image Classification Accuracy measured by AP keyword late fusion baseline police fire smoke helmet car SVM-rbf DBF SVM-lin kNN LD LR EC AP .722 .586 .563 .689 .532 .491 742 .757 .758 .758 .760 .763 .771 Gain · · · · · · +.020 +.035 +.036 +.036 +.038 +.041 +.049 Averge precision (AP) is used as an evaluation metric in our experiments. 4.2. Results Table 2 shows a malicious crowd image classification ac- curacy in AP for a baseline malicious crowd image clas- sification, keyword detections/classifications, and various late fusion approaches. Note that, for a keyword detec- tion/classification, classification accuracy was calculated for recognizing malicious image instead of each associated key- word. For example, when the test image is originally ma- licious while not containing any police in the image, if the police detector does not detect any police in the image, the result is still considered false negative. Using the car detector does not provide competitive accuracy because, as shown in the Table 2.2, car is not significantly relevant to the mali- cious crowd. Other keyword detectors do not provide better classification accuracy than baseline malicious crowd image classification. This is because these sematic keywords (ob- jects) are only seen in small portions in the dataset. However, integrating the baseline with the output of these keyword classifiers/detectors enhanced the classification accuracy by approximately 7% at most. The best performer is EC, a subspace-based ensemble classifier, achieving fusion gain of .049 in AP. We can observe that all fusion approaches improve classification accuracy over the baseline, which sup- ports the benefit of jointly exploiting semantic keywords and the associated detectors and classifiers. Figure 4 shows sev- eral images with high scores in terms of their maliciousness from both malicious and benign classes. 5. CONCLUSION We addressed the new image classification problem where certain classes can be expressed by similar visual features but should be distinguished from each other semantically. To demonstrate, we have constructed a novel malicious crowd image dataset which consists of two classes (benign and mali- cious) that may look similar but contain semantically different events. To better classify the images with the aforementioned characteristics, we have selected representative keywords for malicious crowd images which are then incorporated with conventional image classifiiers using a multi-stream late fu- sion architecture. As Table 2 shows, the approach that we have hypothesized lead to considerable performance improve- ments over the conventional baseline classifier when used in practice.
  • 5. 6. REFERENCES [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin- ton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Pro- cessing Systems 25, 2012. [2] Maxime Oquab, L´eon Bottou, Ivan Laptev, and Josef Sivic, “Is object localization for free? – weakly- supervised learning with convolutional neural net- works,” Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2015. [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recogni- tion,” IEEE Conference on Computer Vision and Pattern Recognition, 2015. [4] Archith J. Bency, Heesung Kwon, Hyungtae Lee, S Karthikeyan, and B. S. Manjunath, “Weakly super- vised localization using deep feature maps,” European Conference on Computer Vision, 2016. [5] Li-Jia Li and Li Fei-Fei, “What, where and who? clas- sifying event by scene and object recognition,” IEEE International Conference on Computer Vision, 2007. [6] Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, JK Aggarwal, Hyungtae Lee, Larry Davis, et al., “A large-scale benchmark dataset for event recog- nition in surveillance video,” IEEE Conference on Com- puter Vision and Pattern Recognition, 2011. [7] George Awad, Jonathan Fiscus, Martial Michel, David Joy, Wessel Kraaij, Alan F. Smeaton, Georges Qu´eenot, Maria Eskevich, Robin Aly, and Roeland Ordelman, “Trecvid 2016: Evaluating video search, video event de- tection, localization, and hyperlinking,” in Proceedings of TRECVID 2016. NIST, USA, 2016. [8] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan, “Object detection with discriminatively trained part based models,” IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 32, no. 9, pp. 1627–1645, 2010. [9] Maxime Oquab, L´eon Bottou, Ivan Laptev, and Josef Sivic, “Learning and transferring mid-level image repre- sentations using convolutional neural networks,” IEEE Conference on Computer Vision and Pattern Recogni- tion, 2014. [10] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,” http://www.pascal- network.org/challenges/VOC/voc2007/workshop/index.html. [11] Ronald Alymer Fisher, “The use of multiple measure- ments in taxonomic problems,” Annals of Eugenics, vol. 7, pp. 179–188, 1936. [12] David A. Freedman, “Statistical models: Theory and practice,” p. 128. Cambridge University Press, 2009. [13] C Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. [14] N. S. Altman, “An introduction to kernel and nearest- neighbor nonparametric regression,” The American Statistician, vol. 46, no. 3, pp. 175–185, 1992. [15] Tin Kam Ho, “The random subspace method for con- structing decision forests,” IEEE Transactions on Pat- tern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 832–844, 1998. [16] Hyungtae Lee, Heesung Kwon, Ryan M. Robinson, William D. Nothwang, and Amar M. Marathe, “Dy- namic belief fusion for object detection,” IEEE Winter Conference on Applications of Computer Vision, 2016.