This document describes a study that uses data mining techniques to detect malware. The study involved extracting opcode frequencies from 300 malware samples and 150 benign software samples. The opcode data was analyzed using the WEKA machine learning tool to generate rules for classifying software as malware or benign. Through a recursive process of removing the top predictive opcode and re-analyzing the data, the study identified a set of opcodes that predicted malware versus benign software with 96% accuracy. Testing the rules against noise added to the data showed the classification remained over 91% accurate, demonstrating the robustness of the approach. The document outlines the full methodology used in the study.
IRJET - Virtual Data Auditing at Overcast EnvironmentIRJET Journal
This document discusses a proposed scheme for remote data integrity checking of files stored in the cloud while hiding sensitive confidential information. The scheme works as follows:
1) A user hides confidential data blocks in an original file and generates signatures before uploading the blinded file to a sanitizer.
2) The sanitizer cleans up the blinded data blocks to remove confidential information and converts the signatures to be valid for the cleaned file.
3) The cleaned file and signatures are then uploaded to the cloud. A third party auditor can remotely check the integrity of the stored data while the confidential information is protected. The scheme supports data sharing in the cloud while maintaining privacy and allowing for integrity audits.
This document is a report on cyber and digital forensics submitted by three students from G.H. Raisoni College of Engineering in Nagpur, India. The report discusses digital forensic methodology, tools used in digital analysis like Backtrack and Nuix, techniques such as live analysis and analyzing deleted files, analyzing USB device history from the Windows registry, and concludes that digital forensics is an evolving field with no set standards yet and constant updates are needed to investigate modern cyber crimes.
Enhancing the Security for Clinical Document Architecture Generating System u...IRJET Journal
The document proposes enhancing security for clinical document architecture (CDA) generation systems using AES encryption with an artificial neural network. Currently, CDA files from different hospitals are integrated in the cloud without strong security. The proposed approach uses AES encryption but replaces the standard key expansion with an artificial neural network trained to generate the encryption keys. This adds an extra layer of security since an adversary would not know the neural network topology used to generate the keys. The neural network is trained to match the output of standard AES key expansion and is analyzed to show it can produce the same encryption as conventional AES.
IRJET- Data Analysis and Solution Prediction using Elasticsearch in Healt...IRJET Journal
This document discusses using Elasticsearch and OData services to analyze large log files from medical devices for remote health monitoring. OData allows RESTful APIs to access and consolidate data from different sources in JSON format. Elasticsearch indexes this data for fast, full-text searching. It can identify issues by searching log files and provide predictive solutions. Kibana visualizes indexed Elasticsearch data to facilitate analysis. The combination of these technologies efficiently manages large medical log data and identifies potential problems for prognostic solutions, improving health care services.
IRJET - Efficient Public Key Cryptosystem for Scalable Data Sharing in Cloud ...IRJET Journal
This document summarizes and evaluates several existing approaches for securely sharing data stored in the cloud. It discusses key aggregate cryptosystems that allow a user to generate a single aggregate key to decrypt a set of ciphertexts. It also reviews other techniques such as attribute-based encryption with proxy re-encryption, dynamic auditing services using random sampling and fragment structures, the Oruta system using ring signatures for public auditing of shared data, and privacy-preserving public auditing using message authentication codes. The document analyzes the advantages and disadvantages of each approach, such as increased key sizes with attribute-based encryption and the storage overhead of fragment structures.
Patient Privacy Control for Health Care in Cloud Computing SystemIRJET Journal
This document describes a cloud-based healthcare system that aims to provide secure sharing of patient health information between healthcare providers while maintaining patient privacy. The system uses authentication and access control schemes along with encryption techniques like attribute-based encryption to restrict access to patient data based on user attributes. It presents the design of the system architecture, which includes patient, doctor and lab units. It also outlines the main algorithms used in the system for key generation, signing data, verifying signatures, and simulating transcript data to protect patient identities. The goal of the system is to enable secure telemedicine and remote diagnosis capabilities while ensuring patient privacy in distributed cloud healthcare computing.
IRJET - Coarse Grain Load Balance Algorithm for DetectingIRJET Journal
This document proposes a new technique for securely querying encrypted DNA databases stored in the cloud. The key points are:
- DNA databases are sensitive personal information but could enable medical research if securely shared. Existing anonymization techniques are insufficient to protect privacy.
- The proposed technique builds on previous work but supports a richer set of queries while being faster. It favors storage over computation to optimize costs, since storage is cheaper than computation in cloud environments.
- The technique encrypts DNA data before outsourcing to the cloud, allowing aggregate queries to be run on the encrypted data while preserving individuals' privacy. This addresses privacy concerns with securely enabling medical analysis of genomic data in cloud databases.
Using metadata in filtered logs for prevention of database intrusion through ...IAEME Publication
This document summarizes a paper presented at the International Conference on Emerging Trends in Engineering and Management in 2014. The paper proposes a method to filter log data using frequent attribute values to reduce the data volume for forensic analysis. Metadata is then used to retrieve evidence of intrusions and notify the database owner. Logs are filtered using an algorithm that rates record redundancy, and metadata of filtered data is analyzed to detect intrusion patterns. If an intrusion is found, the database owner is alerted to reverse harmful actions. The framework involves data reduction using frequent attributes, metadata analysis for evidence, and intrusion notification.
IRJET - Virtual Data Auditing at Overcast EnvironmentIRJET Journal
This document discusses a proposed scheme for remote data integrity checking of files stored in the cloud while hiding sensitive confidential information. The scheme works as follows:
1) A user hides confidential data blocks in an original file and generates signatures before uploading the blinded file to a sanitizer.
2) The sanitizer cleans up the blinded data blocks to remove confidential information and converts the signatures to be valid for the cleaned file.
3) The cleaned file and signatures are then uploaded to the cloud. A third party auditor can remotely check the integrity of the stored data while the confidential information is protected. The scheme supports data sharing in the cloud while maintaining privacy and allowing for integrity audits.
This document is a report on cyber and digital forensics submitted by three students from G.H. Raisoni College of Engineering in Nagpur, India. The report discusses digital forensic methodology, tools used in digital analysis like Backtrack and Nuix, techniques such as live analysis and analyzing deleted files, analyzing USB device history from the Windows registry, and concludes that digital forensics is an evolving field with no set standards yet and constant updates are needed to investigate modern cyber crimes.
Enhancing the Security for Clinical Document Architecture Generating System u...IRJET Journal
The document proposes enhancing security for clinical document architecture (CDA) generation systems using AES encryption with an artificial neural network. Currently, CDA files from different hospitals are integrated in the cloud without strong security. The proposed approach uses AES encryption but replaces the standard key expansion with an artificial neural network trained to generate the encryption keys. This adds an extra layer of security since an adversary would not know the neural network topology used to generate the keys. The neural network is trained to match the output of standard AES key expansion and is analyzed to show it can produce the same encryption as conventional AES.
IRJET- Data Analysis and Solution Prediction using Elasticsearch in Healt...IRJET Journal
This document discusses using Elasticsearch and OData services to analyze large log files from medical devices for remote health monitoring. OData allows RESTful APIs to access and consolidate data from different sources in JSON format. Elasticsearch indexes this data for fast, full-text searching. It can identify issues by searching log files and provide predictive solutions. Kibana visualizes indexed Elasticsearch data to facilitate analysis. The combination of these technologies efficiently manages large medical log data and identifies potential problems for prognostic solutions, improving health care services.
IRJET - Efficient Public Key Cryptosystem for Scalable Data Sharing in Cloud ...IRJET Journal
This document summarizes and evaluates several existing approaches for securely sharing data stored in the cloud. It discusses key aggregate cryptosystems that allow a user to generate a single aggregate key to decrypt a set of ciphertexts. It also reviews other techniques such as attribute-based encryption with proxy re-encryption, dynamic auditing services using random sampling and fragment structures, the Oruta system using ring signatures for public auditing of shared data, and privacy-preserving public auditing using message authentication codes. The document analyzes the advantages and disadvantages of each approach, such as increased key sizes with attribute-based encryption and the storage overhead of fragment structures.
Patient Privacy Control for Health Care in Cloud Computing SystemIRJET Journal
This document describes a cloud-based healthcare system that aims to provide secure sharing of patient health information between healthcare providers while maintaining patient privacy. The system uses authentication and access control schemes along with encryption techniques like attribute-based encryption to restrict access to patient data based on user attributes. It presents the design of the system architecture, which includes patient, doctor and lab units. It also outlines the main algorithms used in the system for key generation, signing data, verifying signatures, and simulating transcript data to protect patient identities. The goal of the system is to enable secure telemedicine and remote diagnosis capabilities while ensuring patient privacy in distributed cloud healthcare computing.
IRJET - Coarse Grain Load Balance Algorithm for DetectingIRJET Journal
This document proposes a new technique for securely querying encrypted DNA databases stored in the cloud. The key points are:
- DNA databases are sensitive personal information but could enable medical research if securely shared. Existing anonymization techniques are insufficient to protect privacy.
- The proposed technique builds on previous work but supports a richer set of queries while being faster. It favors storage over computation to optimize costs, since storage is cheaper than computation in cloud environments.
- The technique encrypts DNA data before outsourcing to the cloud, allowing aggregate queries to be run on the encrypted data while preserving individuals' privacy. This addresses privacy concerns with securely enabling medical analysis of genomic data in cloud databases.
Using metadata in filtered logs for prevention of database intrusion through ...IAEME Publication
This document summarizes a paper presented at the International Conference on Emerging Trends in Engineering and Management in 2014. The paper proposes a method to filter log data using frequent attribute values to reduce the data volume for forensic analysis. Metadata is then used to retrieve evidence of intrusions and notify the database owner. Logs are filtered using an algorithm that rates record redundancy, and metadata of filtered data is analyzed to detect intrusion patterns. If an intrusion is found, the database owner is alerted to reverse harmful actions. The framework involves data reduction using frequent attributes, metadata analysis for evidence, and intrusion notification.
- The document discusses designing a sensor-based experiment using the Brownie framework. It focuses on integrating heart rate biofeedback from sensors into the experiment.
- It provides steps for creating sensor configuration and recorder classes to initialize and record data from a Bioplux sensor, and code examples for configuring sampling rates and storing data files.
- The document aims to teach experimenters how to design experiments that measure and provide real-time biofeedback of physiological signals like heart rate to support research.
Efficient Similarity Search Over Encrypted DataIRJET Journal
1) The document discusses efficient similarity search over encrypted data stored in the cloud. It proposes using Locality Sensitive Hashing (LSH) to enable fast similarity searches of encrypted data without decrypting it first.
2) When a user uploads data, features are extracted and hashed using LSH to group similar documents into buckets. When performing a search, the user's query is hashed to identify matching buckets. Matches are identified by finding correlations between stored documents and the query.
3) The method allows similarity searches of encrypted cloud data efficiently by indexing and hashing documents during upload and generating query hashes to match documents during search, without decrypting the actual data. This addresses privacy and security issues of sensitive data stored
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...ijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
A data estimation for failing nodes using fuzzy logic with integrated microco...IJECEIAES
Continuous data transmission in wireless sensor networks (WSNs) is one of the most important characteristics which makes sensors prone to failure. A backup strategy needs to co-exist with the infrastructure of the network to assure that no data is missing. The proposed system relies on a backup strategy of building a history file that stores all collected data from these nodes. This file is used later on by fuzzy logic to estimate missing data in case of failure. An easily programmable microcontroller unit is equipped with a data storage mechanism used as cost worthy storage media for these data. An error in estimation is calculated constantly and used for updating a reference “optimal table” that is used in the estimation of missing data. The error values also assure that the system doesn’t go into an incremental error state. This paper presents a system integrated of optimal data table, microcontroller, and fuzzy logic to estimate missing data of failing sensors. The adapted approach is guided by the minimum error calculated from previously collected data. Experimental findings show that the system has great potentials of continuing to function with a failing node, with very low processing capabilities and storage requirements.
Intrusion Detection System Based on K-Star Classifier and Feature Set ReductionIOSR Journals
Abstract: Network security and Intrusion Detection Systems (IDS’s) is an important security related research
area. This paper applies K-star algorithm with filtering analysis in order to build a network intrusion detection
system. For our experimental analysis and as a case study, we have used the new NSL-KDD dataset, which is a
modified dataset for KDDCup 1999 intrusion detection benchmark dataset. With a split of 66.0% for the
training set and the remainder for the testing set a 2 class classifications has been implemented. WEKA which is
a java based open source software consists of a collection of machine learning algorithms for Data mining tasks
has been used in the testing process. The experimental results show that the proposed approach is very accurate
with low false positive rate and high true positive rate and it takes less learning time in comparison with other
existing approaches used for efficient network intrusion detection.
Keywords: Information Gain, Intrusion Detection System, Instance-based classifier, K-Star, Weka.
Define and solve the problem of effective and secure ranked keyword search over encrypted cloud data.
Ranked search greatly enhances system usability by returning the matching files in a ranked order regarding to
certain relevance criteria (e.g., keyword frequency), thus making one step closer towards practical deployment of
privecy- preserving data hosting services in Cloud Computing. To improve the security for the data retrieval from
cloud environment, the One Time Password is used. The One Time Passwod is sent to the user mail to view the
original data. The Model exhibits the Querying Process over the cloud computing infrastructure using Secured and
Encrypted Data access and Ranking over the results would benefit the usre for the getting better results.
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...IRJET Journal
This document describes NetReconner, an intrusion detection system that uses regular expressions to detect network attacks. It works by capturing network packets using tcpdump and storing them in a file. A detection engine then compares each line of the captured packets to a set of regular expressions that represent known attacks. If a match is found, an alert is generated. The system also allows administrators to add new regular expressions to detect newly discovered attacks. It was developed to provide continuous monitoring of the network to identify malicious traffic in real-time.
Privacy Preserving Data Leak Detection for Sensitive Datapaperpublications3
Abstract: Number of data leaks in the organization, research institutions and security firms have grown rapidly in recent years. The data leakage occurs if there is no proper protection. The common approach is to monitor the data that are stored in the organization local network. The existing method require the plaintext sensitive data. However, this requirement is undesirable, as it may threaten the confidentiality of the sensitive information. A privacy preserving data-leak detection solution is proposed which can be outsourced and be deployed in a semi-honest detection environment. Fuzzy fingerprint technique is designed and implemented that enhances data privacy during data-leak detection operations. The DLD provider computes fingerprints from network traffic and identifies potential leaks in them. To prevent the DLD provider from gathering exact knowledge about the sensitive data, the collection of potential leaks is composed of real leaks and noises. The evaluation results show that this method can provide accurate detection.
This document summarizes techniques for ensuring data integrity in cloud storage. It discusses Provable Data Possession (PDP) and Proof of Retrievability (PoR) as the two main schemes. PDP allows a client to check that a cloud server possesses their file correctly, while PoR guarantees file retrievability and addresses data corruption concerns using error correcting codes. The document also examines other methods like naive hashing, signature-based approaches, and their limitations regarding public auditing and dynamic operations. Overall, the document provides an overview of the key challenges and state-of-the-art solutions for verifying data integrity in cloud computing.
IRJET - Securing Computers from Remote Access Trojans using Deep Learning...IRJET Journal
This document presents a system that uses deep learning to detect Remote Access Trojans (RATs) with high accuracy. The proposed system has two operators - a host analyzer that monitors the host for irregularities, and a network analyzer that monitors network traffic for RAT patterns using an Artificial Neural Network algorithm. The system was tested on real datasets and achieved over 99.3% accuracy in detecting RAT files, with low false positive rates. Future work includes further reducing false positives and increasing the number of RAT samples to improve accuracy.
Using Learning Vector Quantization in IDS Alert Management SystemCSCJournals
This document presents a new intrusion detection system (IDS) alert management system that uses learning vector quantization (LVQ) to classify IDS alerts. The proposed system takes in alerts generated by Snort from the DARPA 98 dataset, normalizes and filters the alerts, then trains an LVQ neural network on labeled alert data. The trained LVQ model is used to classify new alerts as either true positives or false positives. The system is shown to achieve high classification accuracy of 88.75% and a false positive reduction rate of 88.27%, while only taking 0.000018 seconds on average to classify each alert. This makes the system suitable for active alert management where alerts need to be classified in real-time
Data Security In Relational Database Management SystemCSCJournals
Proving ownerships rights on outsourced relational database is a crucial issue in today\'s internet based application environments and in many content distribution applications. Here mechanism is proposed for proof of ownership based on the secure embedding of a robust imperceptible watermark in relational data. Watermarking of relational databases as a constrained optimization problem and discus efficient techniques to solve the optimization problem and to handle the constraints. This watermarking technique is resilient to watermark synchronization errors because it uses a partioning approach that does not require marker tuple. This approach overcomes a major weakness in previously proposed watermarking techniques. Watermark decoding is based on a threshold-based technique characterized by an optimal threshold that minimizes the probability of decoding errors. An implemented a proof of concept implementation of our watermarking technique and showed by experimental results that our technique is resilient to tuple deletion, alteration and insertion attacks.
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal
This document summarizes a research paper that proposes a methodology for optimizing storage on the cloud using authorized de-duplication. It discusses how de-duplication works to eliminate duplicate data and optimize storage. The key steps are chunking files into blocks, applying secure hash algorithms like SHA-512 to generate unique hashes for each block, and comparing hashes to reference duplicate blocks instead of storing multiple copies. It also discusses using cryptographic techniques like ciphertext-policy attribute-based encryption for authentication and security on public clouds. The proposed approach aims to optimize storage while providing authorized de-duplication functionality.
Data Hiding In Medical Images by Preserving Integrity of ROI Using Semi-Rever...IJERA Editor
Text fusion in images is an important technology for image processing. We have lots of important information related to the patient’s reports and need lots of space to store and the proper position and name which relates that image with that data. In our work we are going to find out the ROI (region of interest) for the particular image and will fuse the related document in the NROI (non-region of interest) of the image, till yet we have many techniques to fuse text data in the medical images one of form them is to fuse data at the boarders of the images and build the particular and pre-defined boarder space. We have propose an algorithm in which we first find out the area of interest and after that we find noisy pixels of the image to embed data in that noisy portions of the image. We use wavelets for smoothing images and segmentation process for extracting region of interest. Coordinates of the noisy pixels have been located and data has been embedded in those pixels .The used embedding technique embed data in least significant bits, hence does not degrade the quality of the image to unacceptable limits. Results show that it gives good PSNR and MSE values which are used for measuring quality effected performance.
Efficient Similarity Search over Encrypted DataIRJET Journal
This document summarizes a research paper that proposes an efficient method for similarity search over encrypted data stored in the cloud. The method uses Locality Sensitive Hashing (LSH) to index the encrypted data and generate "trapdoors" or encodings of search queries. When a user submits a query, the trapdoor is generated and similarity search is performed by finding the similarity between the query trapdoor and the encrypted data indexes stored in the cloud, without decrypting the actual data. The paper outlines the data uploading and query processing steps, which include preprocessing, encryption, trapdoor generation using LSH-based bucketing of n-grams from the query terms, and using Bloom filters for efficient similarity matching during search.
This document summarizes various papers on developing intrusion detection systems using neural networks. It discusses different algorithms researchers have used to train neural networks for intrusion detection, including feed-forward neural networks, self-organizing maps, test driven development neural networks, combinations of supervised and unsupervised learning techniques, differential evolution, and backpropagation neural networks. Each algorithm has advantages and disadvantages. The document concludes that neural networks provide a flexible approach to intrusion detection and can learn new intrusion patterns, and proposes developing an additional level of protection using self-organizing maps to better detect intrusions.
Design and implementation of secured scan based attacks on ic’s by using on c...eSAT Publishing House
The document proposes a scan-protection scheme that provides testing facilities both during chip assembly and over the course of the circuit's life. It compares expected and actual test responses inside the circuit using an efficient principle to scan-in both input vectors and expected responses. This avoids needing to disconnect scan chains after manufacturing testing for security purposes. The proposed approach compares test responses within the chip at the vector level rather than providing the value of each individual scan bit. This ensures security while not relying on costly external test infrastructure. Simulation results show the proposed scheme occupies less area on the chip than existing approaches.
Twelve different machine learning models are built for disease prediction using safe machine learning. Homomorphic plaintext encryption is used for privacy-preserving.
Efficiently Detecting and Analyzing Spam Reviews Using Live Data FeedIRJET Journal
This document proposes a system for efficiently detecting and analyzing spam reviews using a live data feed. The system aims to evaluate genuine customer feedback to help business analysts make decisions. It involves acquiring data from various sources, processing the data in parallel to detect fake reviews, and analyzing the results to identify spam. The key aspects of the system include filtering the data, load balancing among processing servers, aggregating results, and making decisions based on the analysis. The system architecture is divided into three units - data acquisition, data processing, and data analysis and decision making. Various algorithms are used for filtration, load balancing, processing, normalization, and summarization. The system provides accurate identification of spam while extracting useful customer feedback.
This document presents a machine learning approach to classify URLs as benign or malicious using only URL features. It describes current blacklisting approaches and their limitations. The proposed system extracts over 30,000 features from URLs, including information from blacklists, domain registration details, host properties, and lexical features. Logistic regression with L1 regularization is used to classify URLs with over 86.5% accuracy using a diverse set of over 18,000 features. The approach aims to detect malicious URLs that may evade blacklisting with high accuracy using only URL information.
Data processing in Industrial Systems course notes after week 5Ufuk Cebeci
This document discusses database management systems and decision support systems. It begins by outlining some of the challenges with traditional information processing approaches, such as data redundancy and lack of flexibility. It then introduces database management systems as a solution, highlighting their ability to reduce redundancy and integrate related data. Key features of DBMS like logical data structures and relational models are explained. The document also covers decision support systems, noting that they provide interactive support during decision making by using analytical models, specialized databases, and the insights of decision makers. Major components of DSS like model bases are outlined.
This document discusses using machine learning to detect malicious clients. It covers how malware relies on command and control infrastructure and DNS to operate and evade detection. It then presents an architecture that uses big data analytics on network data like DNS queries and WHOIS records to detect malware without relying on endpoint information. Models are developed to detect randomly generated domains, malicious DNS resolvers, and anomalous device behavior. These models analyze features of domains, DNS queries and WHOIS records. The document shows the models achieve high accuracy, precision, recall and AUC in detecting different malware families.
- The document discusses designing a sensor-based experiment using the Brownie framework. It focuses on integrating heart rate biofeedback from sensors into the experiment.
- It provides steps for creating sensor configuration and recorder classes to initialize and record data from a Bioplux sensor, and code examples for configuring sampling rates and storing data files.
- The document aims to teach experimenters how to design experiments that measure and provide real-time biofeedback of physiological signals like heart rate to support research.
Efficient Similarity Search Over Encrypted DataIRJET Journal
1) The document discusses efficient similarity search over encrypted data stored in the cloud. It proposes using Locality Sensitive Hashing (LSH) to enable fast similarity searches of encrypted data without decrypting it first.
2) When a user uploads data, features are extracted and hashed using LSH to group similar documents into buckets. When performing a search, the user's query is hashed to identify matching buckets. Matches are identified by finding correlations between stored documents and the query.
3) The method allows similarity searches of encrypted cloud data efficiently by indexing and hashing documents during upload and generating query hashes to match documents during search, without decrypting the actual data. This addresses privacy and security issues of sensitive data stored
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...ijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
A data estimation for failing nodes using fuzzy logic with integrated microco...IJECEIAES
Continuous data transmission in wireless sensor networks (WSNs) is one of the most important characteristics which makes sensors prone to failure. A backup strategy needs to co-exist with the infrastructure of the network to assure that no data is missing. The proposed system relies on a backup strategy of building a history file that stores all collected data from these nodes. This file is used later on by fuzzy logic to estimate missing data in case of failure. An easily programmable microcontroller unit is equipped with a data storage mechanism used as cost worthy storage media for these data. An error in estimation is calculated constantly and used for updating a reference “optimal table” that is used in the estimation of missing data. The error values also assure that the system doesn’t go into an incremental error state. This paper presents a system integrated of optimal data table, microcontroller, and fuzzy logic to estimate missing data of failing sensors. The adapted approach is guided by the minimum error calculated from previously collected data. Experimental findings show that the system has great potentials of continuing to function with a failing node, with very low processing capabilities and storage requirements.
Intrusion Detection System Based on K-Star Classifier and Feature Set ReductionIOSR Journals
Abstract: Network security and Intrusion Detection Systems (IDS’s) is an important security related research
area. This paper applies K-star algorithm with filtering analysis in order to build a network intrusion detection
system. For our experimental analysis and as a case study, we have used the new NSL-KDD dataset, which is a
modified dataset for KDDCup 1999 intrusion detection benchmark dataset. With a split of 66.0% for the
training set and the remainder for the testing set a 2 class classifications has been implemented. WEKA which is
a java based open source software consists of a collection of machine learning algorithms for Data mining tasks
has been used in the testing process. The experimental results show that the proposed approach is very accurate
with low false positive rate and high true positive rate and it takes less learning time in comparison with other
existing approaches used for efficient network intrusion detection.
Keywords: Information Gain, Intrusion Detection System, Instance-based classifier, K-Star, Weka.
Define and solve the problem of effective and secure ranked keyword search over encrypted cloud data.
Ranked search greatly enhances system usability by returning the matching files in a ranked order regarding to
certain relevance criteria (e.g., keyword frequency), thus making one step closer towards practical deployment of
privecy- preserving data hosting services in Cloud Computing. To improve the security for the data retrieval from
cloud environment, the One Time Password is used. The One Time Passwod is sent to the user mail to view the
original data. The Model exhibits the Querying Process over the cloud computing infrastructure using Secured and
Encrypted Data access and Ranking over the results would benefit the usre for the getting better results.
IRJET- Netreconner: An Innovative Method to Intrusion Detection using Regular...IRJET Journal
This document describes NetReconner, an intrusion detection system that uses regular expressions to detect network attacks. It works by capturing network packets using tcpdump and storing them in a file. A detection engine then compares each line of the captured packets to a set of regular expressions that represent known attacks. If a match is found, an alert is generated. The system also allows administrators to add new regular expressions to detect newly discovered attacks. It was developed to provide continuous monitoring of the network to identify malicious traffic in real-time.
Privacy Preserving Data Leak Detection for Sensitive Datapaperpublications3
Abstract: Number of data leaks in the organization, research institutions and security firms have grown rapidly in recent years. The data leakage occurs if there is no proper protection. The common approach is to monitor the data that are stored in the organization local network. The existing method require the plaintext sensitive data. However, this requirement is undesirable, as it may threaten the confidentiality of the sensitive information. A privacy preserving data-leak detection solution is proposed which can be outsourced and be deployed in a semi-honest detection environment. Fuzzy fingerprint technique is designed and implemented that enhances data privacy during data-leak detection operations. The DLD provider computes fingerprints from network traffic and identifies potential leaks in them. To prevent the DLD provider from gathering exact knowledge about the sensitive data, the collection of potential leaks is composed of real leaks and noises. The evaluation results show that this method can provide accurate detection.
This document summarizes techniques for ensuring data integrity in cloud storage. It discusses Provable Data Possession (PDP) and Proof of Retrievability (PoR) as the two main schemes. PDP allows a client to check that a cloud server possesses their file correctly, while PoR guarantees file retrievability and addresses data corruption concerns using error correcting codes. The document also examines other methods like naive hashing, signature-based approaches, and their limitations regarding public auditing and dynamic operations. Overall, the document provides an overview of the key challenges and state-of-the-art solutions for verifying data integrity in cloud computing.
IRJET - Securing Computers from Remote Access Trojans using Deep Learning...IRJET Journal
This document presents a system that uses deep learning to detect Remote Access Trojans (RATs) with high accuracy. The proposed system has two operators - a host analyzer that monitors the host for irregularities, and a network analyzer that monitors network traffic for RAT patterns using an Artificial Neural Network algorithm. The system was tested on real datasets and achieved over 99.3% accuracy in detecting RAT files, with low false positive rates. Future work includes further reducing false positives and increasing the number of RAT samples to improve accuracy.
Using Learning Vector Quantization in IDS Alert Management SystemCSCJournals
This document presents a new intrusion detection system (IDS) alert management system that uses learning vector quantization (LVQ) to classify IDS alerts. The proposed system takes in alerts generated by Snort from the DARPA 98 dataset, normalizes and filters the alerts, then trains an LVQ neural network on labeled alert data. The trained LVQ model is used to classify new alerts as either true positives or false positives. The system is shown to achieve high classification accuracy of 88.75% and a false positive reduction rate of 88.27%, while only taking 0.000018 seconds on average to classify each alert. This makes the system suitable for active alert management where alerts need to be classified in real-time
Data Security In Relational Database Management SystemCSCJournals
Proving ownerships rights on outsourced relational database is a crucial issue in today\'s internet based application environments and in many content distribution applications. Here mechanism is proposed for proof of ownership based on the secure embedding of a robust imperceptible watermark in relational data. Watermarking of relational databases as a constrained optimization problem and discus efficient techniques to solve the optimization problem and to handle the constraints. This watermarking technique is resilient to watermark synchronization errors because it uses a partioning approach that does not require marker tuple. This approach overcomes a major weakness in previously proposed watermarking techniques. Watermark decoding is based on a threshold-based technique characterized by an optimal threshold that minimizes the probability of decoding errors. An implemented a proof of concept implementation of our watermarking technique and showed by experimental results that our technique is resilient to tuple deletion, alteration and insertion attacks.
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal
This document summarizes a research paper that proposes a methodology for optimizing storage on the cloud using authorized de-duplication. It discusses how de-duplication works to eliminate duplicate data and optimize storage. The key steps are chunking files into blocks, applying secure hash algorithms like SHA-512 to generate unique hashes for each block, and comparing hashes to reference duplicate blocks instead of storing multiple copies. It also discusses using cryptographic techniques like ciphertext-policy attribute-based encryption for authentication and security on public clouds. The proposed approach aims to optimize storage while providing authorized de-duplication functionality.
Data Hiding In Medical Images by Preserving Integrity of ROI Using Semi-Rever...IJERA Editor
Text fusion in images is an important technology for image processing. We have lots of important information related to the patient’s reports and need lots of space to store and the proper position and name which relates that image with that data. In our work we are going to find out the ROI (region of interest) for the particular image and will fuse the related document in the NROI (non-region of interest) of the image, till yet we have many techniques to fuse text data in the medical images one of form them is to fuse data at the boarders of the images and build the particular and pre-defined boarder space. We have propose an algorithm in which we first find out the area of interest and after that we find noisy pixels of the image to embed data in that noisy portions of the image. We use wavelets for smoothing images and segmentation process for extracting region of interest. Coordinates of the noisy pixels have been located and data has been embedded in those pixels .The used embedding technique embed data in least significant bits, hence does not degrade the quality of the image to unacceptable limits. Results show that it gives good PSNR and MSE values which are used for measuring quality effected performance.
Efficient Similarity Search over Encrypted DataIRJET Journal
This document summarizes a research paper that proposes an efficient method for similarity search over encrypted data stored in the cloud. The method uses Locality Sensitive Hashing (LSH) to index the encrypted data and generate "trapdoors" or encodings of search queries. When a user submits a query, the trapdoor is generated and similarity search is performed by finding the similarity between the query trapdoor and the encrypted data indexes stored in the cloud, without decrypting the actual data. The paper outlines the data uploading and query processing steps, which include preprocessing, encryption, trapdoor generation using LSH-based bucketing of n-grams from the query terms, and using Bloom filters for efficient similarity matching during search.
This document summarizes various papers on developing intrusion detection systems using neural networks. It discusses different algorithms researchers have used to train neural networks for intrusion detection, including feed-forward neural networks, self-organizing maps, test driven development neural networks, combinations of supervised and unsupervised learning techniques, differential evolution, and backpropagation neural networks. Each algorithm has advantages and disadvantages. The document concludes that neural networks provide a flexible approach to intrusion detection and can learn new intrusion patterns, and proposes developing an additional level of protection using self-organizing maps to better detect intrusions.
Design and implementation of secured scan based attacks on ic’s by using on c...eSAT Publishing House
The document proposes a scan-protection scheme that provides testing facilities both during chip assembly and over the course of the circuit's life. It compares expected and actual test responses inside the circuit using an efficient principle to scan-in both input vectors and expected responses. This avoids needing to disconnect scan chains after manufacturing testing for security purposes. The proposed approach compares test responses within the chip at the vector level rather than providing the value of each individual scan bit. This ensures security while not relying on costly external test infrastructure. Simulation results show the proposed scheme occupies less area on the chip than existing approaches.
Twelve different machine learning models are built for disease prediction using safe machine learning. Homomorphic plaintext encryption is used for privacy-preserving.
Efficiently Detecting and Analyzing Spam Reviews Using Live Data FeedIRJET Journal
This document proposes a system for efficiently detecting and analyzing spam reviews using a live data feed. The system aims to evaluate genuine customer feedback to help business analysts make decisions. It involves acquiring data from various sources, processing the data in parallel to detect fake reviews, and analyzing the results to identify spam. The key aspects of the system include filtering the data, load balancing among processing servers, aggregating results, and making decisions based on the analysis. The system architecture is divided into three units - data acquisition, data processing, and data analysis and decision making. Various algorithms are used for filtration, load balancing, processing, normalization, and summarization. The system provides accurate identification of spam while extracting useful customer feedback.
This document presents a machine learning approach to classify URLs as benign or malicious using only URL features. It describes current blacklisting approaches and their limitations. The proposed system extracts over 30,000 features from URLs, including information from blacklists, domain registration details, host properties, and lexical features. Logistic regression with L1 regularization is used to classify URLs with over 86.5% accuracy using a diverse set of over 18,000 features. The approach aims to detect malicious URLs that may evade blacklisting with high accuracy using only URL information.
Data processing in Industrial Systems course notes after week 5Ufuk Cebeci
This document discusses database management systems and decision support systems. It begins by outlining some of the challenges with traditional information processing approaches, such as data redundancy and lack of flexibility. It then introduces database management systems as a solution, highlighting their ability to reduce redundancy and integrate related data. Key features of DBMS like logical data structures and relational models are explained. The document also covers decision support systems, noting that they provide interactive support during decision making by using analytical models, specialized databases, and the insights of decision makers. Major components of DSS like model bases are outlined.
This document discusses using machine learning to detect malicious clients. It covers how malware relies on command and control infrastructure and DNS to operate and evade detection. It then presents an architecture that uses big data analytics on network data like DNS queries and WHOIS records to detect malware without relying on endpoint information. Models are developed to detect randomly generated domains, malicious DNS resolvers, and anomalous device behavior. These models analyze features of domains, DNS queries and WHOIS records. The document shows the models achieve high accuracy, precision, recall and AUC in detecting different malware families.
Predictive Security in the 3rd Platform EraIDC Italy
Presentazione di Giancarlo Vercellino, Research & Consulting Manager di IDC Italia, tenuta all'IDC Conference Predictive Security in the 3rd Platform Era di Milano, l'11 marzo 2015
Artificial intelligence in information securitypradnya patil
Artificial intelligence concepts like neural networks, expert systems, intelligent agents, search and learning, and constraint solving were presented. Specific applications of neural networks and intelligent agents were discussed. The presentation was given by Pallavi Ghatage and Pradnya Patil on topics relating to artificial intelligence and information security.
Narus provides cybersecurity analytics and solutions to help customers gain visibility into their network traffic and security threats. Their technology fuses network, semantic, and user data to provide comprehensive security insights. Key challenges include increasing data volumes and diversity of network deployments. Narus addresses these with an integrated analytics platform that uses machine learning to extract metadata and detect anomalies in real-time and over long periods of stored data. Their hybrid approach leverages both Hadoop/Hbase and relational databases for scalable analytics and business intelligence.
Using Machine Learning in Networks Intrusion Detection SystemsOmar Shaya
The internet and different computing devices from desktop computers to smartphones have raised many security and privacy concerns, and the need to automate systems that detect attacks on these networks has emerged in order to be able to protect these networks with scale. And while traditional intrusion detection methods may be able to detect previously known attacks, the issue of dealing with new unknown attacks arises and that brings machine learning as a strong candidate to solve these challenges.
In this report, we investigate the use of machine learning in detecting network attacks, intrusion detection, by looking at work that has been done in this field. Particularly we look at the work that has been done by Pasocal et al.
Jisheng Wang at AI Frontiers: Deep Learning in SecurityAI Frontiers
Deep learning is the next wave of AI-based attack detection. We will share our customer-driven experiences and learnings from building a comprehensive User and Entity Behavior Analytics (UEBA) solution using Apache Spark and Google Tensorflow to detect multi-stage advanced attacks. We will also discuss the challenges and guidelines for successfully deploying deep learning in broader security.
This document discusses user behavioral analytics and machine learning for threat detection. It summarizes that legacy security information and event management (SIEM) technologies are not adequate for detecting insider threats and advanced adversaries. It then describes how user behavioral analytics uses machine learning to develop multi-entity behavioral models across users, applications, hosts, and networks to detect anomalous behavior indicative of insider threats or advanced cyberattacks. Contact information is provided for the security consultant presenting on this topic.
AWS re:Invent 2016: Predictive Security: Using Big Data to Fortify Your Defen...Amazon Web Services
In a rapidly changing IT environment, detecting and responding to new threats is more important than ever. This session shows you how to build a predictive analytics stack on AWS, which harnesses the power of Amazon Machine Learning in conjunction with Amazon Elasticsearch Service, AWS CloudTrail, and VPC Flow Logs to perform tasks such as anomaly detection and log analysis. We also demonstrate how you can use AWS Lambda to act on this information in an automated fashion, such as performing updates to AWS WAF and security groups, leading to an improved security posture and alleviating operational burden on your security teams.
Parallel and Distributed Algorithms for Large Text Datasets AnalysisIllia Ovchynnikov
This document is a thesis submitted for the degree of Bachelor of Computer Science at Opole University of Technology. It explores using distributed systems for processing large text datasets in the context of near duplicate text detection. The study reviews big data concepts, popular analytics frameworks like Hadoop and Spark, and algorithms for determining document duplication levels. The results were applied to develop a prototype distributed anti-plagiarism system that showed improved performance over existing solutions for analyzing large collections of text data.
This document compares the performance of classification algorithms like decision tree, naive Bayes, and k-nearest neighbors using five data mining tools: RapidMiner, WEKA, Tanagra, Orange, and Knime. The algorithms are tested on an Indian liver patient dataset containing 416 liver patient and 167 non-liver patient records. Accuracy scores are reported from the confusion matrices generated by each tool. Overall, Knime achieved the highest accuracy for all three algorithms, with decision tree and k-nearest neighbor performing better than naive Bayes. WEKA had the lowest naive Bayes accuracy.
The document describes several applications that will be demonstrated at the NIH IC Applications Show & Tell Program. It includes summaries of 10 applications, providing details on their functionality, users, and contact information. The applications cover a range of areas including portfolio analysis, library resources, low-cost displays, and research exchange platforms.
This document provides a summary of Abby Brown's technical experience including book reviews and editor roles, software patents submitted and issued, research projects and presentations conducted, and technical memberships. Key details include serving as technical editor for two books on Tibco software, submitting several patents around automated tools for services, metadata, and network alarms, presenting on topics such as cloud computing and SOA, and holding memberships in technical organizations like IEEE, The Open Group, and OASIS.
This document discusses using machine learning to classify malware into families based on the DREBIN dataset. It covers:
1. Preprocessing the dataset, including integer encoding and one-hot encoding to convert categorical data to numeric form for modeling.
2. Addressing overfitting by splitting the data into training and test sets and using cross-validation.
3. Using classifiers like Random Forest and SVM with strategies like one-vs-all and one-vs-one to perform multiclass classification of malware families.
4. The process of using binary classifiers for each family first, then combining the results to classify malware into the appropriate family.
This document appears to be a thesis submitted by Akash Rajguru for the award of a Bachelor of Engineering (Honours) in Software Engineering at Athlone Institute of Technology. The thesis investigates developing an intrusion detection system with honeypot integration using Java. It will focus on researching concepts of intrusion detection, prevention and honeypots. It will also explore Java libraries to develop a desktop application to help network administrators monitor network traffic and packet flow. The application will allow packet capturing, port scanning, blocking ports, and storing captured data locally and remotely in MongoDB. It will also integrate two honeypot servers to capture hacker information.
1) The document proposes developing a web-based course enrollment system using PHP, MySQL, JavaScript, HTML, and CSS.
2) It will allow students to enroll in courses online and provide reports to staff.
3) The system will be tested at the database level and interface level before full implementation. Maintenance of the system will be conducted regularly to ensure functionality.
Data Gaurd Final Thesis for University in Progress (2).docxMohdKashif82
The document is a project report submitted for a master's degree in computer applications. It discusses implementing Oracle Data Guard. The report includes sections on Oracle architecture, Data Guard architecture, installing Oracle Linux and Oracle 19c, configuring a database and standby, and output from the Data Guard configuration. The student declares the work as their own and acknowledges their guide and university faculty for their support and guidance.
This document discusses distributed tracing and OpenTelemetry. It provides an overview of tracing concepts like spans and context propagation. It describes the OpenTelemetry architecture including specifications, instrumentation libraries, and the OpenTelemetry collector. It discusses how to instrument applications for automatic tracing and exporting telemetry data. Finally, it covers best practices for debugging distributed systems using observability data and next steps to get involved in the OpenTelemetry community.
The document describes using a VGG model for image classification of Venice boat types from the MarDCT dataset. It discusses:
1. Using the VGG16 and VGG19 pre-trained models from Keras to extract features from images in the MarDCT training and test sets.
2. Training linear SVM and Random Forest classifiers on the extracted features to classify images into 24 boat types.
3. Evaluating the classifiers using techniques like k-fold cross-validation, and calculating accuracy, precision, recall, and F1 scores.
WSO2 Machine Learner takes data one step further, pairing data gathering and analytics with predictive intelligence: this helps you understand not just the present, but to predict scenarios and generate solutions for the future.
The document discusses decision trees and the ID3 algorithm. It provides an overview of data mining techniques, including decision trees. It then describes the ID3 algorithm in detail, including how it uses information gain to build decision trees top-down and recursively to classify data. An example of applying the ID3 algorithm to a sample dataset is also provided to illustrate the step-by-step process.
This document discusses using machine learning algorithms to predict employee attrition and understand factors that influence turnover. It evaluates different machine learning models on an employee turnover dataset to classify employees who are at risk of leaving. Logistic regression and random forest classifiers are applied and achieve accuracy rates of 78% and 98% respectively. The document also discusses preprocessing techniques and visualizing insights from the models to better understand employee turnover.
Automated Validation of Internet Security Protocols and Applications (AVISPA) Krassen Deltchev
This is my first B.Sc. term paper, 2006. Back in the days my English was bad, which is obvious, while reading the paper, but i still love it, cuz this was my academic starting point on the topic of IT-Security. Enjoy!
This B.Sc. term paper is presented to the
Department of Electrical Engineering and Information Sciences
of the Ruhr-University of Bochum
Chair of Network and Data Security
of the Ruhr-University of Bochum,
Horst-Görtz Institute,
Prof. Jörg Schwenk
Abstract:
The AVISPA Model Checker is a tool for automated validation and verification of security
protocols. It provides a push-button web-based software- and hardware-independent interface and
installation binaries for UNIX-based Operating Systems.
It belongs to the group of the state-of-the-art Model Checkers and uses a modular and descriptive
formal language for specifying industrial-scale security protocols.
The different back-ends of the AVISPA tool implement new optimized analysing techniques for
automated protocol verification.
Therefore the researcher/scientist can prove even bigger in their specification protocols in a short
time and in a user-friendly way.
New cryptographic attacks are explored using the AVISPA tool and the Model-Checker covers
widest range of the modern authentication internet protocols, regarding their security validation.
This document provides an overview of common terminology used in Dimensions market research software and surveys. It defines key terms like Dimensions, Professional, DDL, projects, respondents, categories, and variables. It also outlines common methodologies for survey administration including PAPI, CATI, CAWI, CAPI, and venue interviews. Finally, it provides guidance on reading code examples, noting that optional parameters will be in square brackets.
The document is a software requirements specification for a system to perform record matching over query results from multiple web databases. It describes the purpose, conventions, intended users, product scope, and references. It provides an overall description of the product perspective and functions, describes user classes and characteristics, operating environment, design constraints, and documentation. It outlines external interface requirements including user interfaces, hardware/software interfaces, and communications interfaces. It details system features and other non-functional requirements around performance, safety, security, quality, and business rules.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
MICRE: Microservices In MediCal Research EnvironmentsMartin Chapman
The document discusses integrating microservices into medical research software development. It describes benefits like resilience, scalability, and ease of deployment when dealing with heterogeneous technologies and end-users. Three needs are identified: 1) a model specifying technologies; 2) tools to help researchers segment software; and 3) training. Examples of microservice architectures for medical workflows, clinical guidelines, and a decision support system are provided, along with developer experiences and ideas to explore different technologies and practices.
This document discusses the need for protective apparatus in power systems. Power systems are designed to generate and distribute electric power continuously to meet demand. To ensure reliable service and maximize investment returns, the system must operate continuously without major breakdowns. This can be achieved by implementing protective devices that detect and quickly clear faults, minimizing disruption. Protective devices are needed to isolate faulty sections and maintain continuity of supply to unaffected sections. The basic requirements of protection systems and their components are also outlined.
The document discusses expert systems, which are computer systems that emulate the decision-making ability of a human expert. It describes the typical architecture of an expert system, which includes a knowledge base, inference engine, user interface, explanation facility, and knowledge acquisition system. It provides details on key components like the knowledge base, which stores rules and data, and the inference engine, which applies rules and reasoning to derive conclusions. Specific expert systems are discussed like MYCIN for medical diagnosis, DART for computer fault diagnosis, and XCON for configuring DEC computer systems. The roles of knowledge engineers and domain experts in developing expert systems are also outlined.
1. Open University, Data Mining Seminar 13802
Semester 2015b
Malware Detection via Data Mining
Prof Roy Gelbard
David Zivi 204785638
2. Contents
Terminology and Definitions..............................................................................................................3
Introduction .......................................................................................................................................4
Research Question.............................................................................................................................5
Study goal:......................................................................................................................................5
Study importance:..........................................................................................................................5
Mapping of Knowledge Elements......................................................................................................6
Bibliography Review...........................................................................................................................7
Signature-based detection:............................................................................................................7
Heuristic-based detection:.............................................................................................................7
Behavioral-based detection:..........................................................................................................7
Sandbox detection:........................................................................................................................7
Data mining techniques:................................................................................................................7
Research methodology ......................................................................................................................8
Raw Data Acquisition.....................................................................................................................8
Extraction of Significant Data.........................................................................................................8
Opcode Relevance in malwares.....................................................................................................9
Average Calculation .......................................................................................................................9
Results Export ................................................................................................................................9
Weka ............................................................................................................................................10
Prediction parameters .................................................................................................................10
Noise method...............................................................................................................................10
Results..............................................................................................................................................11
Recurrent WEKA process .............................................................................................................11
First run....................................................................................................................................11
Second run ...............................................................................................................................11
Third run...................................................................................................................................11
Fourth run................................................................................................................................11
Fifth run....................................................................................................................................11
Sixth run...................................................................................................................................11
Seventh run..............................................................................................................................11
Results summary per round.............................................................................................................12
Rules generated by WEKA................................................................................................................13
3. Noise on model............................................................................................................................14
Result Discussion and Future Research ...........................................................................................15
Future Research...........................................................................................................................15
Extension of the opcodes set...................................................................................................15
Sensitive system call ................................................................................................................15
PE header analysis....................................................................................................................15
Resources List...................................................................................................................................16
4. Terminology and Definitions
Virus: A computer virus is a type of malware that propagates by inserting a copy
of itself into, and becoming part of, another program. It spreads from one
computer to another, leaving infections as it travels. Viruses can range in severity
from causing mildly annoying effects to damaging data or software and causing
denial-of-service (DoS) conditions. Almost all viruses are attached to an
executable file, which means the virus may exist on a system, but will not be
active or able to spread until a user runs or opens the malicious host file or
program. When the host code is executed, the viral code is executed as well. [1]
Disassembler: a computer program that translates machine language into
assembly language—the inverse operation to that of an assembler. Disassembly,
the output of a disassembler, is often formatted for human-readability rather
than suitability for input to an assembler, making it principally a reverse-
engineering tool. [2]
Opcode: In computing, an opcode (abbreviated from operation code) is the
portion of a machine language instruction that specifies the operation to be
performed. Beside the opcode itself, instructions usually specify the data they will
process, in form of operands. In addition to opcodes used in instruction set
architectures of various CPUs, which are hardware devices, opcodes can also be
used in abstract computing machines as part of their byte code specifications. [3]
x86 instruction set: x86 is a family of backward compatible instruction set
architectures based on the Intel 8086 CPU and its Intel 8088 variant. The 8086
was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit based 8080
microprocessor, with memory segmentation as a solution for addressing more
memory than can be covered by a plain 16-bit address. The term "x86" came into
being because the names of several successors to the Intel's 8086 processor
ended in "86", including 80186, 80286, 80386 and 80486 processors. [4]
WEKA: is a workbench that contains a collection of visualization tools and
algorithms for data analysis and predictive modeling, together with graphical user
interfaces for easy access to this functionality. All of Weka's techniques are
predicated on the assumption that the data is available as a single flat file or relation,
where each data point is described by a fixed number of attributes. [17]
5. Introduction
In this study I present a technique I developed to determine whether an application is
malware or not using data-mining. The underlying method to be used is to teach the
system how to differentiate between software that is or is not malware, using a dataset
that is represented by a list of instructions that potentially characterize malware. The list
of instructions were collected from previous research done on this topic.
The technique is a two-step technique, where the first step consists of disassembly, and
opcode frequency calculation; the second step consists of usage of the data learning
algorithm J48 provided by the WEKA library.
For the dataset, I extracted relevant data from 300 known malware [6] and 150 types of
benign software typically found in a home computer under “c:program files”. This data
was fed to WEKA, which generated rules to be used to determine whether a piece of
software is malware or not.
These rules were run on the dataset (cross-validation) I compiled, and were able to
predict, with an accuracy of 96%, whether a piece of software is malware or not.
In order to check the robustness of the rules, a noise with intensity scale from 2% to 50%
was randomly to relevant data without significant regression on score mentioned above.
Within noise of up to 50% the prediction score decreased to 91%.
6. Research Question
Study goal:
Today with the exponential growth of the “freeware” software [7], users and
corporations can find a large variety of application and utilities that can be installed for
free. Since those applications come from unknown sources, the question raised is: Can a
user or corporation benefit from some free applications without compromising their
entire system?
The goal of this study is to build rules that will determine whether an executable or
library received by a third party can be trusted or not. This study does not purport to
replace the known anti-viruses, but to propose a complementary mechanism that will
make up for weakness of known anti-virus programs.
Limitations of current methods:
The common technique used by anti-viruses to determine if an executable is a malware
or not is done by scanning. A scanner will search all files in memory and on disk for code
snippets that will uniquely identify a file as malware. Such mechanisms have two main
weaknesses:
Attackers interested in propagating a known malware can just change the code
snippets that anti-virus is looking for.
New malware not yet classified will be considered as benign software till the
malware is analyzed and classified. In such a case malware will continue to infect
the system and expand itself to the new system until the anti-virus is updated.
Study importance:
According to a newly-released report sponsored by McAfee, global cyber activity is
costing up to $500 billion each year, which is almost as much as the estimated cost of
drug trafficking [5]. In the third quarter of 2015, McAfee Labs detected more than 307
new threats every minute, or more than five every second, with mobile malware samples
growing by 16 percent during the quarter, and overall malware surging by 76 percent
year after year [8].
Malware becomes more and more sophisticated and provide high revenues to their
owners. Malware editors becomes well organized and structured with impressive skills
and high qualified resources. Due to the exponential growth of malware and its agility to
camouflage itself, having a sterile system for users and corporates becomes a
tremendous task. Since current anti-viruses are run against known malware databases
which are updated once a day in the best case, malwares have one entire day to infect
the system until they are caught.
7. Mapping of Knowledge Elements
Characteristic/Process In Human World In Machine World
Data
Software editor
Software behavior
All the data is saved in an automatic way
Every malware has its own signature
Malware characteristics are saved in
database
Information
Collect information about known
malware
Collect information about software to
check
Basic statistical calculation on raw data
Knowledge
Can get global feeling about the
software we want to check
Tendency can be deducted
Run algorithm on software in order
to determine if we are dealing with
malware or not
Data transformation to
Information &
Knowledge
With the data we have, we can be
deduce which software we have to
check
Installation validation based on decision
tree
Information &
Knowledge
transformation to Data
The knowledge can be transformed to
data. For example, the software
signature can be saved in a database
Exists in learning system, where the
conclusions and knowledge are
automatically translated to data
Transformation of tacit
knowledge to explicit
knowledge
Malware analyst writes in a formal way
the conclusions reached about
malware pattern
Does not exist
Transformation of
explicit knowledge to
tacit knowledge
Malware analyst learns from explicit
knowledge
Does not exist
Knowledge contribution
in decision
Based on explicit & tacit knowledge ,
analyst decides to accept or reject
software
According to decision tree
Knowledge contribution
to innovation
Update of anti-virus engines based on
knowledge
Does not exist
Learning and knowledge
sharing
Learning of new techniques used by
malware
The system automatically update its
research criterion
8. Bibliography Review
The exponential growth of malware encourages security researchers to invent new techniques to
protect computers and network. The various techniques used for malware detections [9] are
described below:
Signature-based detection:
This technique is the most common method used to identify viruses and other malware. The anti-
virus engine compares the contents of a file to its database of known malware signatures. Such
technique requires daily update of malware database.
Heuristic-based detection:
This technique is generally used together with signature-based detection. It detects malware
based on characteristics typically used in known malware code.
Behavioral-based detection:
This technique is similar to heuristic-based detection and used also in Intrusion Detection System.
The main difference is that, instead of characteristics hardcoded in the malware code itself, it is
based on the behavioral fingerprint of the malware at run-time. Clearly, this technique is able to
detect (known or unknown) malware only after it has starting doing its malicious actions.
Sandbox detection:
This technique is a particular behavioral-based detection that, instead of detecting the behavioral
fingerprint at run time, executes the programs in a virtual environment, logging whatever actions
the program performs. Depending on the actions logged, the anti-virus engine can determine if
the program is malicious or not. If not, the program is executed in the real environment. Even
though this technique has shown to be quite effective it is heavy and slow, so it is rarely used in
end-user anti-virus solutions.
Data mining techniques:
Data mining techniques are one of the latest approaches applied in malware detection. Data
mining and machine learning algorithms are used to try to classify the behavior of a file (as either
malicious or benign) given a series of file features that are extracted from the file itself. In this
study the focus was pointed on the following few techniques:
- Malware detection via analysis of number of strings, call and binary patterns [10]
- Malware detection via analysis of program executable header [11]
- Malware prediction via function call frequency, usage of non-standard
instructions and use of suspicious system calls [12]
- Malware detection via statistical analysis of opcode distributions [13]
9. Research methodology
The following picture describes the flow used in this study to generate rules that will be able to
catch malware:
Raw Data Acquisition
In order to learn from malware and benign software behavior, a large database of samples for
both malware and benign software are needed. Since malware has been analyzed and classified
by security researchers, it is quite easy to find malware databases on the internet [14]. In this
study, all the known and classified malware from year 2014 are used. In order to prevent noise
on the dataset, derivatives of the same malware are not included in the sample. All 400 families
of malware found in 2014 were classified and used in this study. Our benign software dataset is
represented by standard applications located under “c:program files (x86)” such as “Outlook”,
“Word”, “Excel”, and “Calculator”. These were taken from a non-infected computer.
Note: For security reasons, the malware database is protected by a password that can be
retrieved from [14]. All the research and access to malware for this study were done on
dedicated virtual machines in order to prevent unintentional infection.
Extraction of Significant Data
According to the aforementioned bibliography on malware detection via data mining, this study
focuses on the following opcodes: call, nop, int, rdtsc, sbb, shld, fdivp, imul, pushf, setb, fild and
xor. In this study those opcodes represent a criterion for malware detection. All the above
opcodes are extracted from executables and libraries using the IDA [15] disassembler.
10. Opcode Relevance in malwares
One of the main challenges of malwares is their capability of “camouflage”. In order to survive
anti-viruses and security researcher analysis, malwares must hide themselves, since once they
are discovered they are automatically removed from the infected computer. Moreover, when a
malware is discovered, its characteristics are shared with the entire community. In order hide
themselves, the malwares use a technique called “packing” [16] which consist of the
compression/encryption of the original executable. When this compressed executable is
executed, the decompression/decryption code recreates the original code from the
compressed/encrypted code before executing it. Executable compression is also frequently used
to deter reverse engineering or to obfuscate the contents of the executable, for example, to hide
the presence of malware from anti-virus scanners. Executable compression can be used to
prevent direct disassembly; it consists of masking string literals and modifying signatures.
Although this does not eliminate the chance of reverse engineering, it can make the process
more costly. The following picture illustrates how the “packing” mechanism works:
Average Calculation
A script in Python is used to count the number of instances found for each of the relevant
opcodes listed above. Then a value is calculated for every relevant opcode, according to the
following formula (example for “call” opcode):
Call percentage = (Number of Call * Size of Call opcode) / Size of all text section
Results Export
The same Python script now exports an Excel table of all the opcodes, averaged for every
disassembled file (benign software & malware).
11. Weka
The generated excel file is used for WEKA data mining tool analysis. Since the target field is
nominal, J48 & Kstar algorithms are used. The test options used to validate the model is cross
validation with percentage split of 66%
Prediction parameters
TP (true positive): rate of valid prediction of a malware
TN (true negative): rate of valid prediction of benign software
FP (false positive): missed malware prediction i.e. malware was predicted as benign software
FN (false negative): missed benign software prediction i.e. benign software was predicted as
malware
Noise method
A randomized noise with intensity ranging from 5% to 50% is applied on the generated excel
table. The noise is applied only to the head of the tree i.e. in our case on the “imul” instruction.
Applying such noise will determine the robustness of the model, in other words does standard
noise contest the study conclusion or not. Since the noise source can be the result of non-
standard code, for example when code is written directly in assembler by programmers, we
assume that a typical noise can have an intensity of 30%. If the malware prediction score is still
greater than 90% we can conclude that the generated model is robust enough.
12. Results
In this research a deterministic way for malware prediction was formulated and tested. This
method prove to be highly successful in differentiating between malware and benign software. A
key factor for efficiently identifying malware was to have appropriate set of instructions. To find
the ideal instruction set I performed an investigation in a recurrent manner as described below:
Recurrent WEKA process
As opposed to the mentioned research done on malware detection via data mining, like Sanjam
Singla et al [12], in this study the research analysis did not stop when a high level of predictability
was achieved. In every WEKA iteration the “head of the tree” was removed and new analysis was
performed to determine if the “head of the tree” is a key opcode in our prediction or not. Getting
good result after removal of “head of the tree” means that the “head of the tree” cannot be used
for prediction. The following describes the recurrent WEKA process used:
First run
In the first WEKA run, all the potential opcodes are take into account i.e. call, nop, int, rdtsc,
sbb, shld, fdivp, imul, pushf, setb, fild and xor. The prediction score was 98% when the head of
the tree was “call” opcode
Second run
The call opcode was removed so WEKA was run with: nop, int, rdtsc, sbb, shld, fdivp, imul,
pushf, setb, fild and xor. The prediction score was 97% when the head of the tree was “xor”
opcode
Third run
The xor opcode was removed so WEKA was run with: nop, int, rdtsc, sbb, shld, fdivp, imul,
pushf, setb and fild. The prediction score was 97% when the head of the tree was “int” opcode
Fourth run
The int opcode was removed so WEKA was run with: nop, rdtsc, sbb, shld, fdivp, imul, pushf,
setb and fild. The prediction score was 96% when the head of the tree was “rdtsc” opcode
Fifth run
The rdtsc opcode was removed so WEKA was run with: nop, sbb, shld, fdivp, imul, pushf, setb
and fild. The prediction score was 95% when the head of the tree was “sbb” opcode
Sixth run
The sbb opcode was removed so WEKA was run with: nop, shld, fdivp, imul, pushf, setb and fild.
The prediction score was 91% when the head of the tree was “imul” opcode
Seventh run
The imul opcode was removed so WEKA was run with: nop, shld, fdivp, pushf, setb and fild. The
prediction score decreased to 78%. At this point the recurrent processing was stopped due to
significant decrease in prediction score. The last acceptable score was of 91% with opcodes: nop,
shld, fdivp, imul, pushf, setb and fild.
14. Rules generated by WEKA
The following graph represents rules generated by WEKA for malware prediction with score of
96%
The following table represent the Confusion Matrix for J48 algorithm
clean virus
clean 102 4
virus 4 307
15. Noise on model
The following table summarizes the noise intensity at the head of the tree, with the appropriate
Fscore
The following gives a graphical representation of the above results
Noise intensity in % on "imul" variable Fscore
0 0.969
2 0.966
5 0.964
8 0.966
10 0.966
13 0.964
16 0.966
20 0.954
25 0.961
30 0.961
35 0.946
40 0.954
45 0.939
50 0.916
Data set: call,xor,int,rdtsc & sbb removed
16. Result Discussion and Future Research
The goal of this study was to demonstrate that malware can be caught via analysis of executable
opcodes. I have shown that malware detection via data mining is a very promising method, since
the prediction score achieved is 96% for 2014 malware. This study found a different set of
instructions that point to code being malware, compared to previous research that was done on
this topic. The final instructions found in our generated tree are: “imul”, “pushf” and “fild”. As
explained before, those instructions are commonly used by “packer” and “protector” software in
order to unpack/decrypt the malware code.
The novel approaches of the study, as opposed to previous research done in this field, are:
Analysis of the malware surface: Since most of the malwares use packers and protectors
to hide themselves from security researchers, after disassembly only a small portion of
the malware code can be analyzed. In this study it was proved that analysis of a small
portion of code (loader) is enough to detect if the executable is a malware or not.
Recurrent run of WEKA: As opposed to previous research, the WEKA data mining tool
was run many times, in order to found the instructions that really influence the
prediction. Originally the “call” instruction was singled out as identifying software as
malware or not, as was done in research by Sanjam Singla et al [12], but after recurrent
run of WEKA the “imul” instruction was found as the identifier of malware.
I have shown that malware detection via data mining is a very promising method, since the
prediction score achieved is 96% for 2014 malware.
Future Research
Extension of the opcodes set
Since malware changes very frequently, in future research the instruction set used for prediction
must be enlarged and updated according to new malware. Furthermore, malware commonly
uses rare instructions that will never being generated by a compiler, so having such rare
instructions in our instruction dataset will help to recognize them.
Sensitive system call
In this study, only malware instructions were checked. As claimed in the above bibliography,
malware commonly uses sensitive calls like “VirtualAllocEx”, “IsDebugerPresent”. Use of such
calls can point to malware as well
PE header analysis
Another approach to detect malware will be to check the program header format of the
executable. Since during the packing/protect mechanism the “entry point” of the program is
modified, we can find PE header malformations that point to malware as well.
17. Resources List
[1] http://www.cisco.com/web/about/security/intelligence/virus-worm-diffs.html
[2] https://en.wikipedia.org/wiki/Disassembler
[3] https://en.wikipedia.org/wiki/Opcode
[4] http://www.felixcloutier.com/x86/
[5] http://www.foxbusiness.com/technology/2013/07/22/report-cyber-crime-costs-global-
economy-up-to-1-trillion-year/
[6] http://www.nothink.org/honeypots/malware-archives/
[7] https://en.wikipedia.org/wiki/Freeware
[8] http://www.mcafee.com/us/about/news/2014/q4/20141209-01.aspx
[9] https://en.wikipedia.org/wiki/Antivirus_software
[10] M. Schultz, M. Eskin, E. Zadok 2001 Data Mining Methods for Detection of New Malicious
Executables
[11] Usukhbayar Baldangombo et al - A STATIC MALWARE DETECTION SYSTEM USING DATA
MINING
[12] A Novel Approach to Malware Detection using Static Classification Sanjam Singla et al 2015
[13] International Journal of Electronic Security Daniel Bilar, Opcodes as predictor for malware
2007
[14] http://www.nothink.org/honeypots/malware-archives/
[15] https://www.hex-rays.com/products/ida/index.shtml
[16] https://en.wikipedia.org/wiki/Executable_compression
[17] https://en.wikipedia.org/wiki/Weka_%28machine_learning%29