This document summarizes a research article that proposes using a convolutional neural network (CNN) to detect malware in PDF files. The researchers collected benign and malicious PDF files, extracted byte sequences, and manually labeled the data. They designed a CNN model to interpret patterns in the byte sequence data and predict whether a file contains malware. Their experimental results showed that the proposed CNN model outperformed other machine learning models in malware detection of PDF files.
A study of index poisoning in peer topeerIJCI JOURNAL
P2P file sharing systems are the most popular forms of file sharing to date. Its client-server architecture
attains faster file transfers, however with its peer anonymity and lack of authentication it has become a
gold mine for malicious attacks. One of the leading sources of disruptions in the P2P file sharing systems is
the index poisoning attacks. This attack seeks to corrupt the indexes used to reference files available for
download in P2P systems with false data. In order to protect the users from these attacks it is important to
find solutions to eliminate or mitigate the effects of index poisoning attacks. This paper will analyze index
poisoning attacks, their uses and solutions proposed to defend against them.
Automated hierarchical classification of scanned documents using convolutiona...IJECEIAES
This research proposed automated hierarchical classification of scanned documents with characteristics content that have unstructured text and special patterns (specific and short strings) using convolutional neural network (CNN) and regular expression method (REM). The research data using digital correspondence documents with format PDF images from Pusat Data Teknologi dan Informasi (Technology and Information Data Center). The document hierarchy covers type of letter, type of manuscript letter, origin of letter and subject of letter. The research method consists of preprocessing, classification, and storage to database. Preprocessing covers extraction using Tesseract optical character recognition (OCR) and formation of word document vector with Word2Vec. Hierarchical classification uses CNN to classify 5 types of letters and regular expression to classify 4 types of manuscript letter, 15 origins of letter and 25 subjects of letter. The classified documents are stored in the Hive database in Hadoop big data architecture. The amount of data used is 5200 documents, consisting of 4000 for training, 1000 for testing and 200 for classification prediction documents. The trial result of 200 new documents is 188 documents correctly classified and 12 documents incorrectly classified. The accuracy of automated hierarchical classification is 94%. Next, the search of classified scanned documents based on content can be developed.
Clustering of Deep WebPages: A Comparative Studyijcsit
The internethas massive amount of information. This information is stored in the form of zillions of
webpages. The information that can be retrieved by search engines is huge, and this information constitutes
the ‘surface web’.But the remaining information, which is not indexed by search engines – the ‘deep web’,
is much bigger in size than the ‘surface web’, and remains unexploited yet.
Several machine learning techniques have been commonly employed to access deep web content. Under
machine learning, topic models provide a simple way to analyze large volumes of unlabeled text. A ‘topic’is
a cluster of words that frequently occur together and topic models can connect words with similar
meanings and distinguish between words with multiple meanings. In this paper, we cluster deep web
databases employing several methods, and then perform a comparative study. In the first method, we apply
Latent Semantic Analysis (LSA) over the dataset. In the second method, we use a generative probabilistic
model called Latent Dirichlet Allocation(LDA) for modeling content representative of deep web
databases.Both these techniques are implemented after preprocessing the set of web pages to extract page
contents and form contents.Further, we propose another version of Latent Dirichlet Allocation (LDA) to the
dataset. Experimental results show that the proposed method outperforms the existing clustering methods.
Text pre-processing of multilingual for sentiment analysis based on social ne...IJECEIAES
Sentiment analysis (SA) is an enduring area for research especially in the field of text analysis. Text pre-processing is an important aspect to perform SA accurately. This paper presents a text processing model for SA, using natural language processing techniques for twitter data. The basic phases for machine learning are text collection, text cleaning, pre-processing, feature extractions in a text and then categorize the data according to the SA techniques. Keeping the focus on twitter data, the data is extracted in domain specific manner. In data cleaning phase, noisy data, missing data, punctuation, tags and emoticons have been considered. For pre-processing, tokenization is performed which is followed by stop word removal (SWR). The proposed article provides an insight of the techniques, that are used for text pre-processing, the impact of their presence on the dataset. The accuracy of classification techniques has been improved after applying text preprocessing and dimensionality has been reduced. The proposed corpus can be utilized in the area of market analysis, customer behaviour, polling analysis, and brand monitoring. The text pre-processing process can serve as the baseline to apply predictive analysis, machine learning and deep learning algorithms which can be extended according to problem definition.
In recent years the growth of digital data is increasing dramatically, knowledge discovery and data mining have attracted immense attention with coming up need for turning such data into useful information and knowledge. Keyword extraction is considered an essential task in natural language processing (NLP) that facilitates mapping of documents to a concise set of representative single and multi-word phrases. This paper investigates using of Word2Vec and Decision Tree for keywords extraction from textual documents. The Sem-Eval (2010) dataset is used as a main input for the proposed study. The words are represented by vectors with Word2Vec technique following applying pre-processing operations on the dataset. This method is based on word similarity between candidate keywords from both collecting keywords for each label and one sample from the same label. An appropriate threshold has been determined by which the percentages that exceed this threshold are exported to the Decision Tree in order to consider an appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification process. The efficiency and accuracy of the algorithm was measured in the process of classification using precision, recall and F-score rates. The obtained results indicated that using of vector representation for each keyword is an effective way to identify the most similar words, so that the opportunity to recognize the correct classification of the document increases. When using word2Vec CBOW the result of F-Score was 64% with the Gini method and WordNet Lemmatizer. Meanwhile, when using Word2Vec SG the result of F-Score was 82% with Gini Index and English Porter Stemming which considered the highest ratio for all our experiments.
http://sites.google.com/site/ijcsis/
https://google.academia.edu/JournalofComputerScience
https://www.linkedin.com/in/ijcsis-research-publications-8b916516/
http://www.researcherid.com/rid/E-1319-2016
A study of index poisoning in peer topeerIJCI JOURNAL
P2P file sharing systems are the most popular forms of file sharing to date. Its client-server architecture
attains faster file transfers, however with its peer anonymity and lack of authentication it has become a
gold mine for malicious attacks. One of the leading sources of disruptions in the P2P file sharing systems is
the index poisoning attacks. This attack seeks to corrupt the indexes used to reference files available for
download in P2P systems with false data. In order to protect the users from these attacks it is important to
find solutions to eliminate or mitigate the effects of index poisoning attacks. This paper will analyze index
poisoning attacks, their uses and solutions proposed to defend against them.
Automated hierarchical classification of scanned documents using convolutiona...IJECEIAES
This research proposed automated hierarchical classification of scanned documents with characteristics content that have unstructured text and special patterns (specific and short strings) using convolutional neural network (CNN) and regular expression method (REM). The research data using digital correspondence documents with format PDF images from Pusat Data Teknologi dan Informasi (Technology and Information Data Center). The document hierarchy covers type of letter, type of manuscript letter, origin of letter and subject of letter. The research method consists of preprocessing, classification, and storage to database. Preprocessing covers extraction using Tesseract optical character recognition (OCR) and formation of word document vector with Word2Vec. Hierarchical classification uses CNN to classify 5 types of letters and regular expression to classify 4 types of manuscript letter, 15 origins of letter and 25 subjects of letter. The classified documents are stored in the Hive database in Hadoop big data architecture. The amount of data used is 5200 documents, consisting of 4000 for training, 1000 for testing and 200 for classification prediction documents. The trial result of 200 new documents is 188 documents correctly classified and 12 documents incorrectly classified. The accuracy of automated hierarchical classification is 94%. Next, the search of classified scanned documents based on content can be developed.
Clustering of Deep WebPages: A Comparative Studyijcsit
The internethas massive amount of information. This information is stored in the form of zillions of
webpages. The information that can be retrieved by search engines is huge, and this information constitutes
the ‘surface web’.But the remaining information, which is not indexed by search engines – the ‘deep web’,
is much bigger in size than the ‘surface web’, and remains unexploited yet.
Several machine learning techniques have been commonly employed to access deep web content. Under
machine learning, topic models provide a simple way to analyze large volumes of unlabeled text. A ‘topic’is
a cluster of words that frequently occur together and topic models can connect words with similar
meanings and distinguish between words with multiple meanings. In this paper, we cluster deep web
databases employing several methods, and then perform a comparative study. In the first method, we apply
Latent Semantic Analysis (LSA) over the dataset. In the second method, we use a generative probabilistic
model called Latent Dirichlet Allocation(LDA) for modeling content representative of deep web
databases.Both these techniques are implemented after preprocessing the set of web pages to extract page
contents and form contents.Further, we propose another version of Latent Dirichlet Allocation (LDA) to the
dataset. Experimental results show that the proposed method outperforms the existing clustering methods.
Text pre-processing of multilingual for sentiment analysis based on social ne...IJECEIAES
Sentiment analysis (SA) is an enduring area for research especially in the field of text analysis. Text pre-processing is an important aspect to perform SA accurately. This paper presents a text processing model for SA, using natural language processing techniques for twitter data. The basic phases for machine learning are text collection, text cleaning, pre-processing, feature extractions in a text and then categorize the data according to the SA techniques. Keeping the focus on twitter data, the data is extracted in domain specific manner. In data cleaning phase, noisy data, missing data, punctuation, tags and emoticons have been considered. For pre-processing, tokenization is performed which is followed by stop word removal (SWR). The proposed article provides an insight of the techniques, that are used for text pre-processing, the impact of their presence on the dataset. The accuracy of classification techniques has been improved after applying text preprocessing and dimensionality has been reduced. The proposed corpus can be utilized in the area of market analysis, customer behaviour, polling analysis, and brand monitoring. The text pre-processing process can serve as the baseline to apply predictive analysis, machine learning and deep learning algorithms which can be extended according to problem definition.
In recent years the growth of digital data is increasing dramatically, knowledge discovery and data mining have attracted immense attention with coming up need for turning such data into useful information and knowledge. Keyword extraction is considered an essential task in natural language processing (NLP) that facilitates mapping of documents to a concise set of representative single and multi-word phrases. This paper investigates using of Word2Vec and Decision Tree for keywords extraction from textual documents. The Sem-Eval (2010) dataset is used as a main input for the proposed study. The words are represented by vectors with Word2Vec technique following applying pre-processing operations on the dataset. This method is based on word similarity between candidate keywords from both collecting keywords for each label and one sample from the same label. An appropriate threshold has been determined by which the percentages that exceed this threshold are exported to the Decision Tree in order to consider an appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification process. The efficiency and accuracy of the algorithm was measured in the process of classification using precision, recall and F-score rates. The obtained results indicated that using of vector representation for each keyword is an effective way to identify the most similar words, so that the opportunity to recognize the correct classification of the document increases. When using word2Vec CBOW the result of F-Score was 64% with the Gini method and WordNet Lemmatizer. Meanwhile, when using Word2Vec SG the result of F-Score was 82% with Gini Index and English Porter Stemming which considered the highest ratio for all our experiments.
http://sites.google.com/site/ijcsis/
https://google.academia.edu/JournalofComputerScience
https://www.linkedin.com/in/ijcsis-research-publications-8b916516/
http://www.researcherid.com/rid/E-1319-2016
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
This paper is addressed towards extraction of important words from conversations, with the
objective of utilizing these watchwords to recover, for every short audio fragment, a little number of
conceivably relatable reports, which can be prescribed to members, just-in-time. In any case, even a short
audio fragment contains a mixed bag of words, which are conceivably identified with a few topics; also,
utilizing automatic speech recognition (ASR) framework slips errors in the output. Along these lines, it is
hard to surmise correctly the data needs of the discussion members. We first propose a calculation to
remove decisive words from the yield of an ASR framework (or a manual transcript for testing) to
coordinate the potentially differing qualities of subjects and decrease ASR commotion. At that point, we
make use of a technique that to make many implicit queries from the selected keywords which will in
return produce list of relevant documents. The scores demonstrate that our proposition moves forward over
past systems that consider just word recurrence or theme closeness, and speaks to a promising answer for a
report recommender framework to be utilized as a part of discussions.
2016 BE Final year Projects in chennai - 1 Crore Projects 1crore projects
IEEE PROJECTS 2016 - 2017
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Project Domain list 2016
1. IEEE based on datamining and knowledge engineering,
2. IEEE based on mobile computing,
3. IEEE based on networking,
4. IEEE based on Image processing,
5. IEEE based on Multimedia,
6. IEEE based on Network security,
7. IEEE based on parallel and distributed systems
Project Domain list 2016
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2016
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
5. IOT Projects
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US:-
1 CRORE PROJECTS
Door No: 66 ,Ground Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 7708150152
We are developing a web-based plagiarism detection system to detect plagiarism in written Arabic documents. This paper describes the proposed framework of our plagiarism detection system. The proposed plagiarism detection framework comprises of two main components, one global and the other local. The global component is heuristics-based, in which a potentially plagiarized given document is used to construct a set of representative queries by using different best performing heuristics. These queries are then submitted to Google via Google's search API to retrieve candidate source documents from the Web. The local component carries out detailed
similarity computations by combining different similarity computation techniques to check which parts of the given document are plagiarised and from which source documents retrieved from the Web. Since this is an ongoing research project, the quality of overall system is not evaluated yet.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Keyword extraction, concept finding are in learning objects is very important subject in today’s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
Keyword extraction and clustering for document recommendation in conversations.LeMeniz Infotech
Keyword extraction and clustering for document recommendation in conversations.
Do Your Projects With Technology Experts
To Get this projects Call : 9566355386 / 99625 88976
Visit : www.lemenizinfotech.com / www.ieeemaster.com
Mail : projects@lemenizinfotech.com
lectronic-mail is widely used most suitable method of transferring messages electronically from one
person to another, rising from and going to any part of the world. Main features of Electronic mail is its speed,
dependability, well-equipped storage options and a large number of added services make it highly well-liked
among people from all sectors of business and society. But being popular it also has negative side too. Electronics
mails are preferred media for a large number of attacks over the internet.. A number of the most popular attacks over
the internet include spams. Some methods are essentially in detection of spam related mails but they have higher false
positives. A number of filters such as Checksum-based filters, Bayesian filters, machine learning based and
memory-based filters are usually used in order to recognize spams. As spammers constantly try to find a way to
avoid existing filters, a new filters need to be developed to catch spam. This paper proposes to find an
resourceful spam mail filtering method using user profile base ontology. Ontologies permit for machineunderstandable
semantics of data. It is main to interchange information with each other for more efficient spam
filtering. Thus, it is essential to build ontology and a framework for capable email filtering. Using ontology that is
particularly designed to filter spam, bunch of useless bulk email could be filtered out on the system. We propose a
user profile-based spam filter that classifies email based on the likelihood that User profile within it have been
included in spam or valid email.
Privacy preserving is the essential aspect for cloud. In privacy preserving ranked keyword search, data
owner outsource the document in an encrypted form. So for privacy purpose, Reliable ciphertext search technique
is essentially. To design ciphertext search technique which provide encrypted document is critical task. In this
paper, hierarchical clustering method is designed for semantic search and fast ciphertext search within a big data
environment. In hierarchical clustering, documents are based on maximum relevance score and clusters are divided
into sub-clusters.
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...ijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Performance evaluation of decision tree classification algorithms using fraud...journalBEEI
Text mining is one way of extracting knowledge and finding out hidden relationships among data using artificial intelligence methods. Surely, taking advantage of different techniques has been highlighted in previous researches however, the lack of literature focusing on cybercrimes implies the lack of utilization of data mining in facilitating cybercrime investigations in the Philippines. This study therefore classifies computer fraud or online scam data coming from Police incident reports as well as narratives of scam victims as a continuation of a prior study. The dataset consists mainly of unstructured data of 49,822 mainly Filipino words. Further, 5 decision tree algorithms namely, J48, Hoeffding Tree, Decision Stump, REPTree, and Random Forest were employed and compared in terms of their performance and prediction accuracy. The results show that J48 achieves the highest accuracy and the lowest error rate among other classifiers. Results were validated by Police investigators where J48 was likewise preferred as a potential tool to apply in cybercrime investigations. This indicates the importance of text mining in the field of cybercrime investigation domains in the country. Further work can be carried out in the future using different and more inclusive cybercrime datasets and other classification techniques in Weka or any other data mining tool.
Malicious pdf document detection based on feature extraction and entropyijsptm
In this paper we present a machine learning based approach for detection of malicious PDF documents. We identify various features in PDF documents which are used by malware authors to construct a malicious file. Based on these feature set we arrive on models which is used to detect malicious PDF documents. Based on these feature sets, detection rate is high as compared to approaches which depends on analysis of JavaScript embedded in the PDF document.
FEATURE EXTRACTION AND FEATURE SELECTION: REDUCING DATA COMPLEXITY WITH APACH...IJNSA Journal
Feature extraction and feature selection are the first tasks in pre-processing of input logs in order to detect cyber security threats and attacks while utilizing machine learning. When it comes to the analysis of heterogeneous data derived from different sources, these tasks are found to be time-consuming and difficult to be managed efficiently. In this paper, we present an approach for handling feature extraction and feature selection for security analytics of heterogeneous data derived from different network sensors. The approach is implemented in Apache Spark, using its python API, named pyspark.
Comparative Study on Graph-based Information Retrieval: the Case of XML DocumentIJAEMSJORNAL
The processing of massive amounts of data has become indispensable especially with the potential proliferation of big data. The volume of information available nowadays makes it difficult for the user to find relevant information in a vast collection of documents. As a result, the exploitation of vast document collections necessitates the implementation of automated technologies that enable appropriate and effective retrieval. In this paper, we will examine the state of the art of IR in XML documents. We will also discuss some works that have used graphs to represent documents in the context of IR. In the same vein, the relationships between the components of a graph are the center of our attention.
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
This paper is addressed towards extraction of important words from conversations, with the
objective of utilizing these watchwords to recover, for every short audio fragment, a little number of
conceivably relatable reports, which can be prescribed to members, just-in-time. In any case, even a short
audio fragment contains a mixed bag of words, which are conceivably identified with a few topics; also,
utilizing automatic speech recognition (ASR) framework slips errors in the output. Along these lines, it is
hard to surmise correctly the data needs of the discussion members. We first propose a calculation to
remove decisive words from the yield of an ASR framework (or a manual transcript for testing) to
coordinate the potentially differing qualities of subjects and decrease ASR commotion. At that point, we
make use of a technique that to make many implicit queries from the selected keywords which will in
return produce list of relevant documents. The scores demonstrate that our proposition moves forward over
past systems that consider just word recurrence or theme closeness, and speaks to a promising answer for a
report recommender framework to be utilized as a part of discussions.
2016 BE Final year Projects in chennai - 1 Crore Projects 1crore projects
IEEE PROJECTS 2016 - 2017
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Project Domain list 2016
1. IEEE based on datamining and knowledge engineering,
2. IEEE based on mobile computing,
3. IEEE based on networking,
4. IEEE based on Image processing,
5. IEEE based on Multimedia,
6. IEEE based on Network security,
7. IEEE based on parallel and distributed systems
Project Domain list 2016
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2016
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
5. IOT Projects
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US:-
1 CRORE PROJECTS
Door No: 66 ,Ground Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 7708150152
We are developing a web-based plagiarism detection system to detect plagiarism in written Arabic documents. This paper describes the proposed framework of our plagiarism detection system. The proposed plagiarism detection framework comprises of two main components, one global and the other local. The global component is heuristics-based, in which a potentially plagiarized given document is used to construct a set of representative queries by using different best performing heuristics. These queries are then submitted to Google via Google's search API to retrieve candidate source documents from the Web. The local component carries out detailed
similarity computations by combining different similarity computation techniques to check which parts of the given document are plagiarised and from which source documents retrieved from the Web. Since this is an ongoing research project, the quality of overall system is not evaluated yet.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Keyword extraction, concept finding are in learning objects is very important subject in today’s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
Keyword extraction and clustering for document recommendation in conversations.LeMeniz Infotech
Keyword extraction and clustering for document recommendation in conversations.
Do Your Projects With Technology Experts
To Get this projects Call : 9566355386 / 99625 88976
Visit : www.lemenizinfotech.com / www.ieeemaster.com
Mail : projects@lemenizinfotech.com
lectronic-mail is widely used most suitable method of transferring messages electronically from one
person to another, rising from and going to any part of the world. Main features of Electronic mail is its speed,
dependability, well-equipped storage options and a large number of added services make it highly well-liked
among people from all sectors of business and society. But being popular it also has negative side too. Electronics
mails are preferred media for a large number of attacks over the internet.. A number of the most popular attacks over
the internet include spams. Some methods are essentially in detection of spam related mails but they have higher false
positives. A number of filters such as Checksum-based filters, Bayesian filters, machine learning based and
memory-based filters are usually used in order to recognize spams. As spammers constantly try to find a way to
avoid existing filters, a new filters need to be developed to catch spam. This paper proposes to find an
resourceful spam mail filtering method using user profile base ontology. Ontologies permit for machineunderstandable
semantics of data. It is main to interchange information with each other for more efficient spam
filtering. Thus, it is essential to build ontology and a framework for capable email filtering. Using ontology that is
particularly designed to filter spam, bunch of useless bulk email could be filtered out on the system. We propose a
user profile-based spam filter that classifies email based on the likelihood that User profile within it have been
included in spam or valid email.
Privacy preserving is the essential aspect for cloud. In privacy preserving ranked keyword search, data
owner outsource the document in an encrypted form. So for privacy purpose, Reliable ciphertext search technique
is essentially. To design ciphertext search technique which provide encrypted document is critical task. In this
paper, hierarchical clustering method is designed for semantic search and fast ciphertext search within a big data
environment. In hierarchical clustering, documents are based on maximum relevance score and clusters are divided
into sub-clusters.
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...ijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Performance evaluation of decision tree classification algorithms using fraud...journalBEEI
Text mining is one way of extracting knowledge and finding out hidden relationships among data using artificial intelligence methods. Surely, taking advantage of different techniques has been highlighted in previous researches however, the lack of literature focusing on cybercrimes implies the lack of utilization of data mining in facilitating cybercrime investigations in the Philippines. This study therefore classifies computer fraud or online scam data coming from Police incident reports as well as narratives of scam victims as a continuation of a prior study. The dataset consists mainly of unstructured data of 49,822 mainly Filipino words. Further, 5 decision tree algorithms namely, J48, Hoeffding Tree, Decision Stump, REPTree, and Random Forest were employed and compared in terms of their performance and prediction accuracy. The results show that J48 achieves the highest accuracy and the lowest error rate among other classifiers. Results were validated by Police investigators where J48 was likewise preferred as a potential tool to apply in cybercrime investigations. This indicates the importance of text mining in the field of cybercrime investigation domains in the country. Further work can be carried out in the future using different and more inclusive cybercrime datasets and other classification techniques in Weka or any other data mining tool.
Malicious pdf document detection based on feature extraction and entropyijsptm
In this paper we present a machine learning based approach for detection of malicious PDF documents. We identify various features in PDF documents which are used by malware authors to construct a malicious file. Based on these feature set we arrive on models which is used to detect malicious PDF documents. Based on these feature sets, detection rate is high as compared to approaches which depends on analysis of JavaScript embedded in the PDF document.
FEATURE EXTRACTION AND FEATURE SELECTION: REDUCING DATA COMPLEXITY WITH APACH...IJNSA Journal
Feature extraction and feature selection are the first tasks in pre-processing of input logs in order to detect cyber security threats and attacks while utilizing machine learning. When it comes to the analysis of heterogeneous data derived from different sources, these tasks are found to be time-consuming and difficult to be managed efficiently. In this paper, we present an approach for handling feature extraction and feature selection for security analytics of heterogeneous data derived from different network sensors. The approach is implemented in Apache Spark, using its python API, named pyspark.
Comparative Study on Graph-based Information Retrieval: the Case of XML DocumentIJAEMSJORNAL
The processing of massive amounts of data has become indispensable especially with the potential proliferation of big data. The volume of information available nowadays makes it difficult for the user to find relevant information in a vast collection of documents. As a result, the exploitation of vast document collections necessitates the implementation of automated technologies that enable appropriate and effective retrieval. In this paper, we will examine the state of the art of IR in XML documents. We will also discuss some works that have used graphs to represent documents in the context of IR. In the same vein, the relationships between the components of a graph are the center of our attention.
Through the generalization of deep learning, the research community has addressed critical challenges in
the network security domain, like malware identification and anomaly detection. However, they have yet to
discuss deploying them on Internet of Things (IoT) devices for day-to-day operations. IoT devices are often
limited in memory and processing power, rendering the compute-intensive deep learning environment
unusable. This research proposes a way to overcome this barrier by bypassing feature engineering in the
deep learning pipeline and using raw packet data as input. We introduce a feature- engineering-less
machine learning (ML) process to perform malware detection on IoT devices. Our proposed model,”
Feature engineering-less ML (FEL-ML),” is a lighter-weight detection algorithm that expends no extra
computations on “engineered” features. It effectively accelerates the low-powered IoT edge. It is trained
on unprocessed byte-streams of packets. Aside from providing better results, it is quicker than traditional
feature-based methods. FEL-ML facilitates resource-sensitive network traffic security with the added
benefit of eliminating the significant investment by subject matter experts in feature engineering.
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...ijcsit
Through the generalization of deep learning, the research community has addressed critical challenges in
the network security domain, like malware identification and anomaly detection. However, they have yet to
discuss deploying them on Internet of Things (IoT) devices for day-to-day operations. IoT devices are often
limited in memory and processing power, rendering the compute-intensive deep learning environment
unusable. This research proposes a way to overcome this barrier by bypassing feature engineering in the
deep learning pipeline and using raw packet data as input. We introduce a feature- engineering-less
machine learning (ML) process to perform malware detection on IoT devices. Our proposed model,”
Feature engineering-less ML (FEL-ML),” is a lighter-weight detection algorithm that expends no extra
computations on “engineered” features. It effectively accelerates the low-powered IoT edge. It is trained
on unprocessed byte-streams of packets. Aside from providing better results, it is quicker than traditional
feature-based methods. FEL-ML facilitates resource-sensitive network traffic security with the added
benefit of eliminating the significant investment by subject matter experts in feature engineering.
‘CodeAliker’ - Plagiarism Detection on the Cloud acijjournal
Plagiarism is a burning problem that academics have been facing in all of the varied levels of the educational system. With the advent of digital content, the challenge to ensure the integrity of academic work has been amplified. This paper discusses on defining a precise definition of plagiarized computer code, various solutions available for detecting plagiarism and building a cloud platform for plagiarism disclosure.
‘CodeAliker’, our application thus developed automates the submission of assignments and the review process associated for essay text as well as computer code. It has been made available under the GNU’s General Public License as a Free and Open Source Software.
IJWMN -Malware Detection in IoT Systems using Machine Learning Techniquesijwmn
Malware detection in IoT environments necessitates robust methodologies. This study introduces a CNN-LSTM hybrid model for IoT malware identification and evaluates its performance against established methods. Leveraging K-fold cross-validation, the proposed approach achieved 95.5% accuracy, surpassing existing methods. The CNN algorithm enabled superior learning model construction, and the LSTM classifier exhibited heightened accuracy in classification. Comparative analysis against prevalent techniques demonstrated the efficacy of the proposed model, highlighting its potential for enhancing IoT security. The study advocates for future exploration of SVMs as alternatives, emphasizes the need for distributed detection strategies, and underscores the importance of predictive analyses for a more powerful IOT security. This research serves as a platform for developing more resilient security measures in IoT ecosystems.
MALWARE DETECTION IN IOT SYSTEMS USING MACHINE LEARNING TECHNIQUESijwmn
Malware detection in IoT environments necessitates robust methodologies. This study introduces
a CNN-LSTM hybrid model for IoT malware identification and evaluates its performance against
established methods. Leveraging K-fold cross-validation, the proposed approach achieved 95.5%
accuracy, surpassing existing methods. The CNN algorithm enabled superior learning model
construction, and the LSTM classifier exhibited heightened accuracy in classification.
Comparative analysis against prevalent techniques demonstrated the efficacy of the proposed
model, highlighting its potential for enhancing IoT security. The study advocates for future
exploration of SVMs as alternatives, emphasizes the need for distributed detection strategies, and
underscores the importance of predictive analyses for a more powerful IOT security. This
research serves as a platform for developing more resilient security measures in IoT ecosystems.
SECURITY CONSIDERATION IN PEER-TO-PEER NETWORKS WITH A CASE STUDY APPLICATIONIJNSA Journal
Peer-to-Peer (P2P) overlay networks wide adoption has also created vast dangers due to the millions of users who are not conversant with the potential security risks. Lack of centralized control creates great risks to the P2P systems. This is mainly due to the inability to implement proper authentication approaches for threat management. The best possible solutions, however, include encryption, utilization of administration, implementing cryptographic protocols, avoiding personal file sharing, and unauthorized downloads. Recently a new non-DHT based structured P2P system is very suitable for designing secured communication protocols. This approach is based on Linear Diophantine Equation (LDE) [1]. The P2P architectures based on this protocol offer simplified methods to integrate symmetric and asymmetric cryptographies’ solutions into the P2P architecture with no need of utilizing Transport Layer Security (TLS), and its predecessor, Secure Sockets Layer (SSL) protocols.
A real-time big data sentiment analysis for iraqi tweets using spark streamingjournalBEEI
The scale of data streaming in social networks, such as Twitter, is increasing exponentially. Twitter is one of the most important and suitable big data sources for machine learning research in terms of analysis, prediction, extract knowledge, and opinions. People use Twitter platform daily to express their opinion which is a fundamental fact that influence their behaviors. In recent years, the flow of Iraqi dialect has been increased, especially on the Twitter platform. Sentiment analysis for different dialects and opinion mining has become a hot topic in data science researches. In this paper, we will attempt to develop a real-time analytic model for sentiment analysis and opinion mining to Iraqi tweets using spark streaming, also create a dataset for researcher in this field. The Twitter handle Bassam AlRawi is the case study here. The new method is more suitable in the current day machine learning applications and fast online prediction.
Automated server-side model for recognition of security vulnerabilities in sc...IJECEIAES
With the increase of global accessibility of web applications, maintaining a reasonable security level for both user data and server resources has become an extremely challenging issue. Therefore, static code analysis systems can help web developers to reduce time and cost. In this paper, a new static analysis model is proposed. This model is designed to discover the security problems in scripting languages. The proposed model is implemented in a prototype SCAT, which is a static code analysis tool. SCAT applies the phases of the proposed model to catch security vulnerabilities in PHP 5.3. Empirical results attest that the proposed prototype is feasible and is able to contribute to the security of real-world web applications. SCAT managed to detect 94% of security vulnerabilities found in the testing benchmarks; this clearly indicates that the proposed model is able to provide an effective solution to complicated web systems by offering benefits of securing private data for users and maintaining web application stability for web applications providers.
Privacy preserving data mining is an important
issue nowadays for data mining. Since various organizations
and people are generating sensitive data or information these
days. They don’t want to share their sensitive data however
that data can be useful for data mining purpose. So, due to
privacy preserving mining that data can be mined usefully
without harming the privacy of that data. Privacy can be
preserved by applying encryption on database which is to be
mined because now the data is secure due to encryption. Code
profiling is a field in software engineering where we can
apply data mining to discover some knowledge so that it will
be useful in future development of software. In this work we
have applied privacy preserving mining in code profiling data
such as software metrics of various codes. Results of data
mining on actual and encrypted data are compared for
accuracy. We have also analyzed the results of privacy
preserving mining in code profiling data and found
interesting results.
IP Messenger And File Transfer over Ethernet LANdbpublications
Today many forms of communication exist throughout the computer technology over the world since past two decades. From instant messaging towards shared folders, FTP, Telephony communication, Mail transferring and so on. Each serves different purposes and uses distinct techniques to operate. IP Messenger and File Transfer over Ethernet LAN is a combination of Instant Messaging and File Sharing. IP Messenger and File Transfer is a LAN-based network application used within an organization for the employees and colleagues to share the files among themselves and to have text-chat with each other, this application replaces the use of shared folders and usage of e-mail and USB drive to transfer files and also replaces the use of instant messaging such as Skype, WhatsApp etc.
Information Upload and retrieval using SP Theory of IntelligenceINFOGAIN PUBLICATION
In today’s technology Cloud computing has become an important aspect and storing of data on cloud is of high importance as the need for virtual space to store massive amount of data has grown during the years. However time taken for uploading and downloading is limited by processing time and thus need arises to solve this issue to handle large data and their processing. Another common problem is de duplication. With the cloud services growing at a rapid rate it is also associated by increasing large volumes of data being stored on remote servers of cloud. But most of the remote stored files are duplicated because of uploading the same file by different users at different locations. A recent survey by EMC says about 75% of the digital data present on cloud are duplicate copies. To overcome these two problems in this paper we are using SP theory of intelligence using lossless compression of information, which makes the big data smaller and thus reduces the problems in storage and management of large amounts of data.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. 2 Security and Communication Networks
In the following section, we review previous studies and the
structure of PDF documents.
2. Background
2.1. Malware Detection on Stream Data. Malware is a pro-
gram written to give an undesirable or harmful effect on
a computer system. As the technology of malicious code
generation by attackers becomes more intelligent, various
researches have been conducted for detection and analysis
of malicious codes. The malware can be divided into two
categories: executables and nonexecutables. There have been
many security programs to detect malicious actions in the
form of portable executable files (e.g., Norton, Kaspersky).
However, the nonexecutables (e.g., malicious actions in
PDF documents) are easy to bypass some existing security
programs and there is a high risk of false positives. Such
document type malware is known to be more dangerous, as it
is often considered as being insignificant by common users.
In order to detect the malicious documents, many studies
focused on feature extraction based on the PDF structure
analysis, where the features can be seen as a summary of
the logical structure. In [7], a format-independent static
analysis method, namely, a hierarchical document structure
(hidost), was proposed. They adopted a structural multimap,
where structural paths are represented by keys and the leaves
are indicated by values. Several values of the multimap are
reduced to a median value to constitute feature vectors that
are used to train machine-learning algorithms. Cuan et al.
[8] defined features using PDFiD Python script which verifies
objects displayed in PDFs. As a simple trick called gradient-
descent attack may bypass the proposed approach, they
proposed two solutions: using a threshold for each feature
and reducing the number of noncritical features. Smutz and
Stavrou [9] extracted features from document metadata,
such as the number of characters in titles and the size of
images. Li et al. [10] developed a robust and secure feature
extractor called FEPDF, which parses and extracts features
from PDF documents. FEPDF consists of a matching method,
detecting the PDF header, detecting all objects, detecting
cross-reference, and detecting trailer. They emphasized that
FEPDF can identify new malicious PDFs that have not been
identified by existing feature extractors.
Note that these studies commonly require intensive fea-
ture engineering process, because it will almost determine
the performance (i.e., accuracy) of the malware detection. In
this paper, we designed a convolutional neural network to
tackle the malware detection on nonexecutables. Although
the proposed network allows getting results by just pushing
the binary sequences into it without feature engineering, it is
still important to investigate the structure of the target data.
That is, the neural networks automatically capture features,
but the features are obtained from input data which must
be defined by an expert. Furthermore, understanding of
the data structure helps to design better networks. In this
paper, we target the PDF files because PDF-based attacks are
known to be one of the major attacks recently. The following
subsection provides detailed explanation about the structure
of the PDFs.
2.2. Stream Data of PDF Files. The PDF document consists
of header, body, cross-reference table, and trailer areas. The
header contains document version information, and the
body has objects that contain information about the actual
document. The cross-reference table carries tables used to
refer to objects. The trailer contains the root object and
cross-reference table position information among the objects
in the body region. The applications of PDFs (e.g., PDF
reader) allow execution of JavaScript codes within the files,
which means that any functions can be dynamically executed
through JavaScript API.
The JavaScript code is contained in a stream, one of the
PDF object types, where the object types include Boolean
values, arrays, dictionaries, streams, and indirect objects. The
stream is a collection of consecutive bytes of binary data that
varies in length. The stream is normally supposed to contain
large image files or page composition objects. A sample of
stream object in a benign PDF file is shown in Figure 1. The
object number is 141 and has 1,392 bytes. This stream can be
decoded using FlateDecode filter, and the result is depicted
in Figure 2. The function in this sample seems normal, but it
is obvious that the users will be in danger if some malicious
actions are contained in the stream.
In Figure 3, an example of a stream object obtained from
a malicious PDF document is shown. The object number is
17 and its length is 1,279 bytes. The decoded beginning part
of the malicious JavaScript code is shown in Figure 4, where
it uses the ‘unescape’ function that decodes a string object
encoded with ‘escape’ function. These two functions are
intentionally employed to make the users confused, thereby
making difficult to detect the malicious actions.
Figures 5 and 6 show another example of a malicious
PDF stream object. The JavaScript code in Figure 6 calls
the ‘replace’ function which uses regular expressions to
replace certain characters with a desired string. As shown so
far, the malicious JavaScript codes often appear obfuscated
by encoding, and some functions (e.g., replace, unescape)
trigger malicious actions.
There have been studies, of course, that focused on the
JavaScript codes in the PDFs. Khitan et al. [12] defined
features based on functions, constants, objects, methods,
and keywords as well as lexical properties of JavaScript
codes. Zhang [13] used features such as the number of
objects, number of pages, and stream filtering information
of JavaScript codes, together with the PDF structure infor-
mation, entity characteristics, metadata information, and
content statistics attributes. Liu et al. [1] proposed a context-
aware approach based on the observation that the malicious
JavaScript works differently from the normal JavaScript. This
approach monitors suspicious activity based on JavaScript
statements by putting the original code as an argument to
‘eval’ function and then opens the PDF document. The ‘eval’
function executes the delivered executable JavaScript code
and returns a result. That is, it executes the code in a separated
environment and finds suspicious codes running dropped
malware.
In this study, we do not perform lexical analysis on
JavaScript code or run PDF documents for dynamic analysis
as in previous studies. We design a convolutional neural
3. Security and Communication Networks 3
Figure 1: Example of a stream object in a positive PDF file.
Figure 2: Example of a decoded stream content of a positive PDF
file.
network that takes a byte sequence of a stream as an input
and predicts whether the input sequence contains malicious
actions or not. In next subsection, we briefly review the
previous studies adopting neural networks to tackle the
malware detection task.
2.3. Neural Networks for Malware Detection. There have been
few studies thus far in applying neural networks to malicious
software (malware) detection. Most recent works among
them have used features extracted through dynamic analysis,
so the features are extracted under the binary run in a
virtualized environment. Kolosnjaji et al. [14] proposed a
combination of convolutions and long short-term memory
(LSTM) [15] to classify malware types based on the features
of the API call sequences. Huang and Strokes [16] defined
a manual 114 high-level features out of API calls as well
as original function calls to predict malware types. This
approach is essentially composed of two models, malware
detection and malware type classification. The authors argued
that the shared parameters of the two models contribute to
improving the overall performance. These studies of dynamic
analysis are performed on a certain nonpublic emulation
environments, which makes it difficult to reproduce the
works.
The other line of malware detection is the static analysis,
in which features are obtained from the files without running
them. Raff et al. [17] applied neural networks to the malware
identification problem using the features in forms of 300
raw bytes of the PE-header of each file. This work showed
Figure 3: Example of a stream object in a malicious PDF file.
Figure 4: Example of a decoded stream content of a malicious PDF
file.
Figure 5: Another example of a stream object in a malicious PDF
file.
Figure 6: Another example of a decoded stream content of a
malicious PDF file.
4. 4 Security and Communication Networks
that the neural networks are capable of extracting underlying
high-level interpretation from the raw bytes, which in turn
makes it possible to develop malware detectors without hand-
crafted features. Saxe and Berlin [18] utilized a histogram of
byte entropy values from the entire files and defined a fixed
length feature vector as the input of the neural networks.
Le et al. [19] designed CNN-BiLSTM architecture, where the
rational of taking bidirectional LSTM (BiLSTM) layer on
top of CNN layer is that the BiLSTM layer may interpret
sequential dependencies between different pieces generated
by the CNN layer. They showed that the CNN layer is effective
in representing local patterns of fixed length, and the BiLSTM
has a potential to capture arbitrary sequential dependencies
of executables. Raff et al. [20] derived a feature vector
from the raw bytes and designed a shallow convolutional
neural network with a gated convolutional layer, a global
max-pooling layer, and a fully connected output layer. They
insisted that their work is the first to define a feature vector
from the entire binary, and it was hard to develop deeper
networks due to the extraordinarily long byte strings (e.g.,
1-2M length). Their biggest contribution is that the feature
vector is obtained from the entire binary, so that it may grasp
the global context of the entire binary. That is, the contents
of a binary may have the high amount of positional variation
or can be rearranged in arbitrary ordering, so they adopted
the global-level feature vectors with a very large dimension.
Although their network is designed to have ability to scale
well with variable length of binary strings, it essentially will
not be applicable to any longer binary strings (e.g., 3-4M
length). All of these studies commonly utilized raw bytes of
executables.
The proposed network in this paper is designed to take
a byte sequence within the nonexecutables as an input
and generates an output based on high-level patterns of
collectable spatial clues, which implies that the proposed
network is applicable to byte sequences with variable lengths.
In the following section, we illustrate how we design the
proposed network according to the characteristics of the
input data (i.e., byte sequences).
3. Proposed Method
To detect malicious actions without heavy feature engineer-
ing, we designed a deep learning model against stream objects
and discriminate the maliciousness in the object level. The
stream objects have no size limitation, and a certain part
of the stream exhibits malicious actions while the other
part does not. Such high-level location invariance makes it
difficult to detect the maliciousness of the object. Among
deep learning models, convolutional neural networks (CNN)
are known as successful in detecting locally contextual pat-
terns. The CNN models have brought dramatic performance
advance in the area of image processing [21, 22], and one
of its benefits is that it works with smaller amount of data,
compared to other deep learning models (e.g., recurrent
neural networks, fullyconnected neural networks).
A graphical representation of our proposed network is
depicted in Figure 7, where the network consists of an
embedding layer, two convolutional layers, a pooling layer, a
fully connected layer, and an output layer. Note that the figure
just shows a structure of the network, and the dimensions of
the layers and the number of channels must be much larger.
The 𝐸 denotes the embedding size and 𝑆 means a sequence
length. Rather than directly submitting the raw byte values
into convolution (i.e., using a scaled version of a byte’s value
from 0 to 255), we adopt an embedding layer to map each byte
to a 𝐸-dimensional feature vector because the byte values do
not imply intensity but convey some contextual information.
That is, given a byte sequence of length 𝑆, the 𝐸 × 𝑆 real-valued
embedding matrix is computed during the training process,
so that the matrix will help to grasp a wider breadth of input
patterns.
The embedding layer makes it possible to represent
meaning of each byte by incorporating all the byte sequences.
As discussed in [20], the raw values of byte sequences do
not simply represent intensity, and it will be better to find an
alternative way to see the values. For example, a byte value
160 does not imply ‘better’ or ‘stronger’ intensity than 130,
but the two values must convey different meaning. In the
‘word embedding’ concept in the natural language processing
(NLP) field, similar words (e.g., ‘hi’ and ‘hello’) are close
to each other in the embedding space, whereas opposite
words are far from each other. Likewise, our embedding
layer interprets the contextual meaning of byte values and
represents them on the embedding space.
Several locally adjacent 𝐸-dimensional vectors gener-
ated from the embedding layer are then fed into the first
convolutional layer followed by another convolutional layer.
The first convolutional layer is designed to take a C1 × E
matrix which is supposed to carry spatial clues of malicious
actions, in the hope that the collected clues will be enough
for the entire network to make a wise decision. In the
recent work [20], a convolutional neural network designed
for analysing the entire sequence at once was proposed, but
this network will not be applicable to longer sequences. In
contrast, our network collects simple local clues and generates
high-level representation, which means that the proposed
network is available to all sequences with variable lengths.
Each convolutional layer takes one or more adjacent vectors
(or values) from its previous layer as an input and generates an
output value by a summation of element-wise multiplication
with a filter. This filter, also known as a kernel, is computed
during the training process.
The two convolutional layers have K1 and K2 distinct
channels (i.e., kernels), respectively, to find different spatial
patterns. The convolutional layer slides or convolves from
the top-left corner to the bottom-right corner with an
arbitrary stride, thereby resulting matrix or vector will carry a
compilation of spatial patterns throughout the input. Similar
to many studies related to CNN [2, 23, 24], our network has
consecutive convolutional layers because multiple stacked
convolutional layers are known to have better (higher-level)
representation than a single convolutional layer. We have
two convolutional layers stacked along the network depth,
where the first convolutional layer takes a C1 × E matrix as
an input, and the second convolutional layer takes a C2 × 1
vector as an input, respectively. The first layer is supposed
to catch simple clues, which will be used for the second
5. Security and Communication Networks 5
b0 b1 b2 b3 . . .
Conv-1 Conv-2 Pool Flatten
Fully
connected Output
E
S
C1
C2
P
F 2
K1 K2
Figure 7: Graphical representation of the proposed convolutional neural network.
layer to understand underlying higher-level patterns. For
example, the first layer may capture simple clues of JavaScript
statements (e.g., replace), while the second layer represents
contextual information about their types or arguments.
One may argue that stacking more convolutional layers
might be better, because the deeper network is often known
to be capable of extracting more complicated patterns. This
might be true, but it should be noted that the deeper network
is not always better than the shallow networks. The length
of network must be considered carefully by investigating the
data; too complicated network will probably overfit, whereas
too simple network will underfit. By examination into the
sequence data, we conclude that the two convolutional layers
will be enough to capture the variety of the spatial patterns of
malicious actions. We also, of course, show that this structure
is indeed better than deeper networks through experiments.
The high-level representation obtained from the two
consecutive convolutional layers is submitted to the pooling
layer that helps to focus on some representative or primary
patterns. Among several types of pooling layer such as
average-pooling and L2-norm pooling, we adopt the max-
pooling layer as it is generally known to be effective in various
tasks. Similar to the convolutional layer, the pooling layer
slides from the top-left corner to the bottom-right corner
with an arbitrary stride, resulting in output vectors of much
smaller size. The output vectors are flattened or concatenated
to form a one-dimensional vector that will be passed to the 𝐹-
dimensional fully connected (FC) layer. The FC layer collects
the primary patterns from the pooled values, and the last
output layer represents how likely the given byte sequence
embeds malicious actions.
4. Experiments
4.1. Dataset. We collected PDF files from various antivirus
or antimalware companies and constructed a dataset through
manual labelling process. There often exist several stream
objects in a malicious PDF file, and one of more streams
contains malicious actions. As we observed that all mali-
cious streams contain JavaScript codes, we mark the objects
as ‘malicious’ if they contain JavaScript codes, otherwise
‘benign’. The objects of benign PDF files are all marked as
‘benign’. Note that we do not employ whether to contain
JavaScript as a feature at all. The proposed network will
Table 1: Data statistics, where the positive and negative percentages
imply the proportions of the malicious streams and the benign
streams, respectively.
Size (# of instances) 1,978
Positive:Negative (%) 50.0:50.0
Length (dimension) 1,000
see only the raw byte sequence. The object streams are
hexadecimal representations of text streams, as shown in
Figure 8, where each line of the figure is a part of the
conversion of Figures 2, 4, and 6, respectively.
The data statistics are summarized in Table 1, where each
data instance is defined as a byte-level stream object. As
shown in Figure 9, for each data file, multiple byte streams are
collected, each of which is manually labelled with a positive
(malicious) or a negative (benign) tag. The total amount of
originally collected byte sequences was 4,371, but randomly
downsampled 989 negative instances.
Note that the byte streams have different lengths as
described in Figure 10, but the proposed network cannot
handle such sequences of variable lengths. As the network is
designed to grasp high-level patterns from collectable spatial
clues of malicious actions, it is not necessary to use the whole
sequence of different length at a time. Rather, we padded
short sequences and cut away the remaining part from long
sequences, so that all sequences have the same length of
1,000. When the network is trained with these fixed length
sequences, it can be applied to divided byte sequences with
the same size in a target file, so that it will predict whether
the target file contains malware or not.
4.2. Model. We compared the proposed network with several
representative classification methods such as support vector
machine [25], decision tree, na¨ıve bayes, and random forest
[26]. The brief description and parameter settings of the
methods are summarized in Table 2.
We also examined various settings for our network, and
the best setting was found as follows: (1) the embedding
dimension E = 25, (2) C1 and C2 are set to 3, respectively, (3)
pooling layer dimension P = 100, (4) K1 and K2 are 32 and
64, respectively, and (5) fully connected layer dimension F =
128, where the notations can be found in Figure 7. The two
6. 6 Security and Communication Networks
Figure 8: Byte streams.
Table 2: Description of comparative classifiers and parameter settings.
Model Description
Na¨ıve bayes (NB) (i) Probabilistic classifier based on the bayes’ theorem
(ii) Assumes the independence between features
Decision tree (DT) (i) C4.5 classifier using J48 algorithm
(ii) Confidence factor for pruning: 0.25
Support vector machine (SVM)
(i) Non-probabilistic binary classification model that finds a decision boundary with a maximum
distance between two classes
(ii) Kernel: Poly
(iii) Exponent=1.0, Complexity c=1.0
(iv) Trained via sequential minimal optimization algorithm [11]
Random forest (RF)
(i) Kind of ensemble model that generates final result by incorporating results of multiple
decision-trees
(ii) #trees=100
(iii) #features=log(#trees)+1
(iv) Each tree has no depth-limitation
Byte stream
Byte stream
Byte stream
Byte stream
Byte stream
positive
negative
negative
negative
negative
File File
Figure 9: Positive or negative tags for each byte stream in the data
files.
convolutional layers have strides (1, 25) and (1, 1), respectively,
and the pooling layer has a nonoverlapping stride (100, 1). The
sequence length S must be 1,000. Every layer takes rectified
linear unit (ReLU) as an activation function except for the
output layer that takes a softmax function. The cost function
is defined as a cross entropy over all nodes of the output layer.
Consequently, the total numbers of trainable parameters are
89,371.
0
500
1000
1500
2000
2500
3000
3500
4000
200 400 600 800 1000 1200 1400 1600 1800 2000 > 2000
freq(clean)
freq(malware)
Figure 10: Frequencies of clean and malware PDF byte streams with
different lengths, where the x-axis represents stream lengths and the
y-axis shows frequencies.
We adopt Adam’s optimizer [27] with an initial learning
rate of 0.001 to train our network. Our training recipe is as
follows: (1) L2 gradient-clipping with 0.5, (2) drop-out [28]
with a keeping probability 0.25 for the fully connected layer,
and (3) batch normalization [29] for the two convolutional
layers. We tried to use the regularization methods such as L2
regularization and decov [30] and observed no performance
improvements, as the batch normalization and drop-out
are known to have regularization effect themselves. The
weight matrices of the convolutional layers, the FC layer,
and the output layer are initialized by He’s algorithm [31],
and bias vectors are initialized with zeros. In the following
7. Security and Communication Networks 7
Table 3: Experimental results with the PDF dataset, where the two values of each cell are of ‘benign’ and ‘malicious’, respectively.
Model Precision Recall F1
DT 96.00 / 90.30 89.70 / 96.30 92.70 / 93.20
NB 88.40 / 99.70 99.70 / 87.00 93.70 / 92.90
SVM 94.70 / 98.90 99.00 / 94.40 96.80 / 96.60
RF 93.50 / 99.40 99.40 / 93.10 96.40 / 96.10
Emb+Conv+Conv+Pool+FC 99.76 / 100.0 97.37 / 97.37 98.48 / 98.65
Conv+Conv+Pool+FC 99.78 / 100.0 92.62 / 97.27 95.71 / 98.61
Emb+Conv+Pool+FC 99.73 / 100.0 94.94 / 97.78 97.12 / 98.87
Emb+Conv+Conv+FC 99.67 / 100.0 97.78 / 92.32 98.55 / 96.00
Emb+Conv+Conv+Conv+Pool+FC 99.70 / 100.0 92.21 / 95.35 95.36 / 97.60
subsection, we demonstrate the performance of our network
by experimental comparison with the other classifiers and
different networks.
4.3. Result. One unfortunate aspect of the field of malware
detection is that there is no available public dataset for various
reasons. The dataset easily obtainable from public is often
not of a sufficient quality, so previous studies could not
compare performance (i.e., accuracy) across works because
of different data characteristics and labelling procedures. It
is inevitably hard to compare our performance with other
state-of-the-art studies for the same reason. We compare our
network with some comparative machine-learning methods
and CNN models with different settings. The measurement
is conducted using 10-fold cross validation, where the perfor-
mance values (i.e., F1 score, precision, and recall) are averaged
values of three distinct trials.
The experimental results are described in Table 3,
where the two values for each cell correspond to the
‘benign’ and ‘malicious’ classes, respectively. For example,
the F1 scores of random forest (RF) are 96.4 for ‘benign’
and 96.1 for ‘malicious’. For the four traditional machine-
learning models (e.g., DT, NB, SVM, and RF), the values
of the input sequence are treated as nominal values. We
tested five different structures of CNNs. The first network,
Emb+Conv+Conv+Pool+FC, is the best structure which has
an embedding layer, two consecutive convolutional layers,
a pooling layer, and a fully connected layer followed by an
output layer. The number of epoch differs from networks
according to the complexity (i.e., the number of layers and
parameters) of the networks and the dataset size. For instance,
the first network is trained through 30 epochs, whereas
training of the second network requires 25 epochs.
Among the traditional machine-learning models, the
support vector machine (SVM) exhibits the best F1 scores and
random forest (RF) has the comparable results. As also shown
in Table 3, it seems obvious that the five CNNs outperform
the traditional machine-learning models. From the results
of the five networks, we can find two main observations.
First, the embedding layer (Emb) seems to play a significant
role in better representation of byte sequences, as the second
network without the embedding layer exhibits much worse
F1 scores than the other networks. Second, the stacked
convolutional layers enable interpreting high-level patterns,
and the optimal number of layers seems to be two. The third
network having a single convolutional layer and the fifth
network having three convolutional layers are worse than the
first network. The fifth network especially has the worst F1
score among five networks. This indicates that more stacked
layers are not always better than shallow networks.
4.4. Discussion. The experimental results can be summarized
in two aspects. First, the convolutional neural networks
showed superior performance than the traditional machine-
learning models. The F1 score of the proposed network is
almost 2% greater than the SVM, which can be explained that
the convolutional neural networks have better comprehensive
power to analyse the underlying spatial patterns of the
byte sequences. Second, it seems that the embedding layer
followed by the two convolutional layers is best suited to
representing high-level patterns of malicious actions. Less or
more stacked convolutional layers gave worse results.
Other than the two aspects, we need to discuss about
the parameter settings for training. The results of Table 3
are obtained from the network trained with the dimensions
and the parameters as aforementioned in Model part. We
found the optimal parameter setting by grid search, and some
remarkable results with different settings and dimensions are
summarized in Table 4. The first two rows correspond to
the embedding size 𝐸, and the last two rows are associated
with the pooling size 𝑃. Other remaining rows are related to
the training recipe, such as drop-out, gradient-clipping, and
batch normalization. The drop-out together with the batch
normalization was helpful to generalize the network, and
we observed no improvements with additional regularization
methods. The gradient-clipping made the network more
robust, as it prevented from tripping over desirable points of
the gradient space.
We also checked the training time of the comparable
methods. Table 5 shows the elapsed seconds for training dif-
ferent methods. The training of the four traditional methods
is performed on a machine of Xeon E5-2620 V4 with 128GB
RAM, while the CNN models are trained using a machine
of i7-9700K with 64GB RAM and two RTX 2080 Ti. The
training time of the CNN models strongly depends on the
performance of GPUs. All methods are trained using the 1,978
instances, and the number of epochs is equally 30 for the
CNN models. Note that the fourth CNN model without a
8. 8 Security and Communication Networks
Table 4: Experimental results with different settings, where the two values of each cell are of ‘benign’ and ‘malicious’, respectively.
Dimensions & Training options Precision Recall F1
E = 15 99.68 / 100.0 92.52 / 95.75 95.34 / 97.82
E = 35 99.73 / 100.0 94.03 / 97.47 96.58 / 98.71
Drop-out with 0.5 99.70 / 100.0 93.53 / 97.37 96.23 / 98.66
L2 gradient-clipping with 0.3 99.74 / 100.0 95.25 / 97.17 97.17 / 98.55
L2 gradient-clipping with 0.7 99.70 / 100.0 92.11 / 97.78 95.14 / 98.86
w/o batch-normalization 99.70 / 100.0 95.96 / 97.27 97.68 / 98.61
P = 50 99.70 / 100.0 95.05 / 97.07 97.13 / 98.50
P = 150 99.76 / 100.0 93.83 / 97.88 96.29 / 98.92
Table 5: Training time of comparable methods.
Model Time (seconds)
DT 0.69
NB 0.16
SVM 750.88
RF 1.56
Emb+Conv+Conv+Pool+FC 22.07
Conv+Conv+Pool+FC 21.12
Emb+Conv+Pool+FC 13.85
Emb+Conv+Conv+FC 31.52
Emb+Conv+Conv+Conv+Pool+FC 23.16
pooling layer takes much longer time than the other three
CNN models. The reason is that the pooling layer has an effect
of reducing the filter sizes, so the fourth CNN model has
greater number of parameters to train. Except for the SVM,
the traditional models seem computationally more efficient
than the CNN models. One of the reasons for the lagging
of training SVM must be the use of Poly kernel function.
We may use another kernel functions (e.g., linear kernel), of
course, but it will probably degrade accuracy. As the training
of random forest (RF) is much faster than the CNN models,
it might be preferred to choose the RF model if we want
efficiency with a small loss of effectiveness (i.e., accuracy).
5. Conclusion
The threat of malicious documents keeps growing, because
it is difficult to detect the malicious actions within the docu-
ments. In this work, we proposed a new convolutional neural
network designed to take a byte sequence of nonexecutables
as an input and predicts whether the given sequence has
malicious actions or not. We illustrated how we design the
network according to the characteristics of the input data
and provided discussion about the experiments using the
manually labelled dataset. The experimental results showed
that the proposed network outperforms several represen-
tative machine-learning models as well as other convolu-
tional neural networks with different settings. Though we
conducted experiments only with the PDF files, we expect
that this approach can be applicable to other types of data
if they contain byte streams. Therefore, as a future work, we
will collect data of other file types (e.g., .rtf files) and perform
further investigation.
Data Availability
We disclose our dataset as well as a code to public (https://
sites.google.com/view/datasets-for-public).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the Soonchunhyang University
Research Fund (no. 20170265). This work was supported
by Basic Science Research Program through the National
Research Foundation of Korea (NRF) funded by the Ministry
of Education (NRF-2017R1D1A3B030360 50).
References
[1] D. Liu, H. Wang, and A. Stavrou, “Detecting malicious
JavaScript in pdf through document instrumentation,” in Pro-
ceedings of the 44th Annual IEEE/IFIP International Confer-
ence on Dependable Systems and Networks, pp. 100–111, IEEE,
Atlanta, GA, USA, June 2014.
[2] T. T. Um, F. M. J. Pfister, D. Pichler et al., “Data augmentation of
wearable sensor data for Parkinson’s disease monitoring using
convolutional neural networks,” in Proceedings of the 19th ACM
International Conference on Multimodal Interaction, pp. 216–
220, ACM, Glasgow, Scotland, 2017.
[3] Z. M. Kim, Y. S. Jeong, H. R. Oh et al., “Investigating the impact
of possession-way of a smartphone on action recognition,”
Sensors, vol. 16, no. 6, pp. 1–5, 2016.
[4] Y. Kim, “Convolutional neural networks for sentence classifica-
tion,” in Proceedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing, pp. 1746–1751, Association
for Computational Linguistics, Doha, Qatar, October 2014.
[5] A. Hannun, C. Case, and J. Casper, “Deep speech: scaling up
end-to-end speech recognition,” Computing Research Reposi-
tory, pp. 1–12, 2014.
[6] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
look once: Unified, real-time object detection,” in Proceedings
of the 2016 IEEE Conference on Computer Vision and Pattern
9. Security and Communication Networks 9
Recognition, pp. 779–788, IEEE, Las Vegas Valley, NV, USA, July
2016.
[7] N. ˇSrndi´c and P. Laskov, “Hidost: a static machine-learning-
based detector of malicious files,” EURASIP Journal on Infor-
mation Security, vol. 2016, no. 1, p. 22, 2016.
[8] B. Cuan, A. Damien, C. Delaplace et al., Malware Detection
in PDF Files Using Machine Learning [PhD. Thesis], REDOCS,
2018.
[9] C. Smutz and A. Stavrou, “Malicious PDF detection using
metadata and structural features,” in Proceedings of the 28th
Annual Computer Security Applications Conference, pp. 239–
248, Orlando, Fla, USA, December 2012.
[10] M. Li, Y. Liu, M. Yu et al., “FEPDF: a robust feature extractor
for malicious PDF detection,” in Proceedings of the 2017 IEEE
Trustcom/BigDataSE/ICESS, pp. 218–224, IEEE, Sydney, Aus-
tralia, August 2017.
[11] J. C. Platt, “Sequential minimal optimization: a fast algorithm
for training support vector machines,” in Advances in Kernel
Methods – Support Vector Learning, MIT Press, Cambridge,
Mass, USA, 1998.
[12] S. J. Khitan, A. Hadi, and J. Atoum, “PDF forensic analysis
system using YARA,” International Journal of Computer Science
and Network Security, vol. 17, no. 5, pp. 77–85, 2017.
[13] J. Zhang, “MLPdf: an effective machine learning based approach
for PDF malware detection,” Security and Cryptography, 2018.
[14] B. Kolosnjaji, A. Zarras, G. Webster et al., “Deep learning for
classification of malware system call sequences,” Lecture Notes
in Computer Science, vol. 9992, pp. 137–149, 2016.
[15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[16] W. Huang and J. W. Stokes, “MtNet: a multi-task neural network
for dynamic malware classification,” Lecture Notes in Computer
Science, vol. 9721, pp. 399–418, 2016.
[17] E. Raff, J. Sylvester, and C. Nicholas, “Learning the PE header,
malware detection with minimal domain knowledge,” in Pro-
ceedings of the 10th ACM Workshop on Artificial Intelligence and
Security, pp. 121–132, ACM, Dallas, TX, USA, 2017.
[18] J. Saxe and K. Berlin, “Deep neural network based malware
detection using two dimensional binary program features,” in
Proceedings of the 10th International Conference on Malicious
and Unwanted Software, pp. 11–20, IEEE, Fajardo, PR, USA,
October 2015.
[19] Q. Le, O. Boydell, B. Mac Namee, and M. Scanlon, “Deep
learning at the shallow end: Malware classification for non-
domain experts,” Digital Investigation, vol. 26, pp. S118–S126,
2018.
[20] E. Raff, J. B. Barker, J. Sylvester et al., “Malware detection by
eating a whole EXE,” in Proceedings of the in Proceedings of the
Workshops of the Thirty-Second AAAI Conference on Artificial
Intelligence, pp. 268–276, New Orleans, LA, USA, 2018.
[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the
IEEE, vol. 86, no. 11, pp. 2278–2323, 1998.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
classification with deep convolutional neural networks,” in
Proceedings of the Advances in Neural Information Processing
Systems 25, Lake Tahoe, NV, USA, December 2012.
[23] C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolu-
tions,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 1–9, IEEE, Boston, MA, USA, June
2015.
[24] G. Huang, Z. Liu, L. Maaten et al., “Densely connected convolu-
tional networks,” in Proceedings of the 30th IEEE Conference on
Computer Vision and Pattern Recognition, pp. 2261–2269, IEEE,
Honolulu, HI, USA, July 2017.
[25] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “Training algorithm
for optimal margin classifiers,” in Proceedings of the 5th Annual
ACM Workshop on Computational Learning Theory, pp. 144–
152, ACM, Pittsburgh, PA, USA, July 1992.
[26] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1,
pp. 5–32, 2001.
[27] D. P. Kingma and J. L. Ba, “Adam: a method for stochas-
tic optimization,” in Proceedings of the in Proceedings of the
3rd International Conference for Learning Representations, San
Diego, Calif, USA, 2015.
[28] N. Srivastava, G. Hinton, A. Krizhevsky et al., “Dropout:
a simple way to prevent neural networks from overfitting,”
Journal of Machine Learning Research, vol. 15, pp. 1929–1958,
2014.
[29] S. Ioffe and C. Szegedy, “Batch normalization: accelerating
deep network training by reducing internal covariate shift,” in
Proceedings of the 32nd International Conference on Machine
Learning, pp. 448–456, ACM, Lille, France, July 2015.
[30] M. Cogswell, F. Ahmed, R. Girshick et al., “Reducing over-
fitting in deep networks by decorrelating representations,” in
Proceedings of the 4th International Conference for Learning
Representations, San Juan, PR, USA, 2016.
[31] K. He, X. Zhang, S. Ren et al., “Delving deep into rectifiers:
surpassing human-level performance on imagenet classifica-
tion,” in Proceedings of the 15th IEEE International Conference
on Computer Vision, pp. 1026–1034, Santiago, Chile, December
2015.
10. International Journal of
Aerospace
EngineeringHindawi
www.hindawi.com Volume 2018
Robotics
Journal of
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Active and Passive
Electronic Components
VLSI Design
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Shock and Vibration
Hindawi
www.hindawi.com Volume 2018
Civil Engineering
Advances in
Acoustics and Vibration
Advances in
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Electrical and Computer
Engineering
Journal of
Advances in
OptoElectronics
Hindawi
www.hindawi.com
Volume 2018
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2013
Hindawi
www.hindawi.com
The Scientific
World Journal
Volume 2018
Control Science
and Engineering
Journal of
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com
Journal of
Engineering
Volume 2018
Sensors
Journal of
Hindawi
www.hindawi.com Volume 2018
International Journal of
Rotating
Machinery
Hindawi
www.hindawi.com Volume 2018
Modelling &
Simulation
in Engineering
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Chemical Engineering
International Journal of Antennas and
Propagation
International Journal of
Hindawi
www.hindawi.com Volume 2018
Hindawi
www.hindawi.com Volume 2018
Navigation and
Observation
International Journal of
Hindawi
www.hindawi.com Volume 2018
Advances in
Multimedia
Submit your manuscripts at
www.hindawi.com