This document summarizes a presentation given on using decision trees and machine learning techniques for anomaly detection on the NSL KDD Cup 99 dataset. It discusses anomaly detection, machine learning, different machine learning algorithms like decision trees, SVM, Naive Bayes etc. and their application for intrusion detection. It then describes an experiment conducted using the decision tree algorithm on the NSL KDD Cup 99 dataset to classify network traffic as normal or anomalous. The results showed the decision tree model achieved over 98% accuracy on both the full dataset and a reduced feature set.
Using Machine Learning in Networks Intrusion Detection SystemsOmar Shaya
The internet and different computing devices from desktop computers to smartphones have raised many security and privacy concerns, and the need to automate systems that detect attacks on these networks has emerged in order to be able to protect these networks with scale. And while traditional intrusion detection methods may be able to detect previously known attacks, the issue of dealing with new unknown attacks arises and that brings machine learning as a strong candidate to solve these challenges.
In this report, we investigate the use of machine learning in detecting network attacks, intrusion detection, by looking at work that has been done in this field. Particularly we look at the work that has been done by Pasocal et al.
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...Jowin John Chemban
By:
Jowin John Chemban (jowinchemban@gmail.com)
HGW16CS022 (2016-2020 Batch)
S7 B.Tech Computer Science Engineering
Holy Grace Academy of Engineering, Mala
Date : September 2019
With the growth of computer networking, electronic commerce and web services, security networking systems have become very important to protect infomation and networks againts malicious usage or attacks. In this report, it is designed an Intrusion Detection System using two artificial neural networks: one for Intrusion Detection and the another for Attack Classification.
Using Machine Learning in Networks Intrusion Detection SystemsOmar Shaya
The internet and different computing devices from desktop computers to smartphones have raised many security and privacy concerns, and the need to automate systems that detect attacks on these networks has emerged in order to be able to protect these networks with scale. And while traditional intrusion detection methods may be able to detect previously known attacks, the issue of dealing with new unknown attacks arises and that brings machine learning as a strong candidate to solve these challenges.
In this report, we investigate the use of machine learning in detecting network attacks, intrusion detection, by looking at work that has been done in this field. Particularly we look at the work that has been done by Pasocal et al.
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...Jowin John Chemban
By:
Jowin John Chemban (jowinchemban@gmail.com)
HGW16CS022 (2016-2020 Batch)
S7 B.Tech Computer Science Engineering
Holy Grace Academy of Engineering, Mala
Date : September 2019
With the growth of computer networking, electronic commerce and web services, security networking systems have become very important to protect infomation and networks againts malicious usage or attacks. In this report, it is designed an Intrusion Detection System using two artificial neural networks: one for Intrusion Detection and the another for Attack Classification.
Computer Security and Intrusion Detection(IDS/IPS)LJ PROJECTS
This ppt explain you various type of possible attack, security property, Traffic Analysis, Security mechanism Intrusion detection system, vulnerability, Attack framework etc.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance. In this talk, we will introduce anomaly detection and discuss the various analytical and machine learning techniques used in in this field. Through a case study, we will discuss how anomaly detection techniques could be applied to energy data sets. We will also demonstrate, using R and Apache Spark, an application to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
Poisoning attacks on Federated Learning based IoT Intrusion Detection SystemSai Kiran Kadam
Attacks on federated learning model are discussed as a part of my research to build a model that overcomes the diverse security issues and vulnerabilities in the cloud in the process of building a unified machine learning model that can benefit multi-user/ multi-companies to work together.
Part 1
- Introduction
- Application for Anomaly Detection
- AIOps
- GraphDB
Part 2
- Type Of Anomaly Detection
- How to Identify Outliers in your Data
Part 3
- Anomaly Detection for Timeseries Technique
In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method proposed by Thomas Cover used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.
Seminar Report | Network Intrusion Detection using Supervised Machine Learnin...Jowin John Chemban
Seminar Report : Network Intrusion Detection using Supervised Machine Learning Technique with Feature Selection
By:
Jowin John Chemban (jowinchemban@gmail.com)
HGW16CS022 (2016-2020 Batch)
S7 B.Tech Computer Science Engineering
Holy Grace Academy of Engineering, Mala
Date : November 2019
Malware Dectection Using Machine learningShubham Dubey
Malware detection is an important factor in the security of the computer systems. However, currently utilized signature-based methods cannot provide accurate detection of zero-day attacks and polymorphic viruses. That is why the need for machine learning-based detection arises.
Anomaly detection is a topic with many different applications. From social media tracking, to cybersecurity, anomaly detection (or outlier detection) algorithms can have a huge impact in your organisation.
For the video please visit: https://www.youtube.com/watch?v=XEM2bYYxkTU
This slideshare has been produced by the Tesseract Academy (http://tesseract.academy), a company that educates decision makers in deep technical topics such as data science, analytics, machine learning and blockchain.
If you are interested in data science and related topics, make sure to also visit The Data Scientist: http://thedatascientist.com.
A review of machine learning based anomaly detectionMohamed Elfadly
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior. These nonconforming patterns are often referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities, or contaminants in different application domains.
What is IDS?
Software or hardware device
Monitors network or hosts for:
Malware (viruses, trojans, worms)
Network attacks via vulnerable ports
Host based attacks, e.g. privilege escalation
What is in an IDS?
An IDS normally consists of:
Various sensors based within the network or on hosts
These are responsible for generating the security events
A central engine
This correlates the events and uses heuristic techniques and rules to create alerts
A console
To enable an administrator to monitor the alerts and configure/tune the sensors
Different types of IDS
Network IDS (NIDS)
Examines all network traffic that passes the NIC that the sensor is running on
Host based IDS (HIDS)
An agent on the host that monitors host activities and log files
Stack-Based IDS
An agent on the host that monitors all of the packets that leave or enter the host
Can monitor a specific protocol(s) (e.g. HTTP for webserver)
Detecting Hacks: Anomaly Detection on Networking DataJames Sirota
See https://medium.com/@jamessirota for a series of blog entries that goes with this deck...
Defense in Depth for Big Data
Network Anomaly Detection Overview
Volume Anomaly Detection
Feature Anomaly Detection
Model Architecture
Deployment on OpenSOC Platform
Questions
A Practical Guide to Anomaly Detection for DevOpsBigPanda
Recent years have seen an explosion in the volumes of data that modern production environments generate. Making fast educated decisions about production incidents is more challenging than ever. BigPanda's team is passionate about solutions such as anomaly detection that tackle this very challenge.
Computer Security and Intrusion Detection(IDS/IPS)LJ PROJECTS
This ppt explain you various type of possible attack, security property, Traffic Analysis, Security mechanism Intrusion detection system, vulnerability, Attack framework etc.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance. In this talk, we will introduce anomaly detection and discuss the various analytical and machine learning techniques used in in this field. Through a case study, we will discuss how anomaly detection techniques could be applied to energy data sets. We will also demonstrate, using R and Apache Spark, an application to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
Poisoning attacks on Federated Learning based IoT Intrusion Detection SystemSai Kiran Kadam
Attacks on federated learning model are discussed as a part of my research to build a model that overcomes the diverse security issues and vulnerabilities in the cloud in the process of building a unified machine learning model that can benefit multi-user/ multi-companies to work together.
Part 1
- Introduction
- Application for Anomaly Detection
- AIOps
- GraphDB
Part 2
- Type Of Anomaly Detection
- How to Identify Outliers in your Data
Part 3
- Anomaly Detection for Timeseries Technique
In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method proposed by Thomas Cover used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.
Seminar Report | Network Intrusion Detection using Supervised Machine Learnin...Jowin John Chemban
Seminar Report : Network Intrusion Detection using Supervised Machine Learning Technique with Feature Selection
By:
Jowin John Chemban (jowinchemban@gmail.com)
HGW16CS022 (2016-2020 Batch)
S7 B.Tech Computer Science Engineering
Holy Grace Academy of Engineering, Mala
Date : November 2019
Malware Dectection Using Machine learningShubham Dubey
Malware detection is an important factor in the security of the computer systems. However, currently utilized signature-based methods cannot provide accurate detection of zero-day attacks and polymorphic viruses. That is why the need for machine learning-based detection arises.
Anomaly detection is a topic with many different applications. From social media tracking, to cybersecurity, anomaly detection (or outlier detection) algorithms can have a huge impact in your organisation.
For the video please visit: https://www.youtube.com/watch?v=XEM2bYYxkTU
This slideshare has been produced by the Tesseract Academy (http://tesseract.academy), a company that educates decision makers in deep technical topics such as data science, analytics, machine learning and blockchain.
If you are interested in data science and related topics, make sure to also visit The Data Scientist: http://thedatascientist.com.
A review of machine learning based anomaly detectionMohamed Elfadly
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior. These nonconforming patterns are often referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities, or contaminants in different application domains.
What is IDS?
Software or hardware device
Monitors network or hosts for:
Malware (viruses, trojans, worms)
Network attacks via vulnerable ports
Host based attacks, e.g. privilege escalation
What is in an IDS?
An IDS normally consists of:
Various sensors based within the network or on hosts
These are responsible for generating the security events
A central engine
This correlates the events and uses heuristic techniques and rules to create alerts
A console
To enable an administrator to monitor the alerts and configure/tune the sensors
Different types of IDS
Network IDS (NIDS)
Examines all network traffic that passes the NIC that the sensor is running on
Host based IDS (HIDS)
An agent on the host that monitors host activities and log files
Stack-Based IDS
An agent on the host that monitors all of the packets that leave or enter the host
Can monitor a specific protocol(s) (e.g. HTTP for webserver)
Detecting Hacks: Anomaly Detection on Networking DataJames Sirota
See https://medium.com/@jamessirota for a series of blog entries that goes with this deck...
Defense in Depth for Big Data
Network Anomaly Detection Overview
Volume Anomaly Detection
Feature Anomaly Detection
Model Architecture
Deployment on OpenSOC Platform
Questions
A Practical Guide to Anomaly Detection for DevOpsBigPanda
Recent years have seen an explosion in the volumes of data that modern production environments generate. Making fast educated decisions about production incidents is more challenging than ever. BigPanda's team is passionate about solutions such as anomaly detection that tackle this very challenge.
This presentation will present topics such as "What is Anomaly Detection? What are the different types of Data that may be used? What are the popular techniques may be used to identify anomalies. What are the best practices in anomaly detection? What is the Value of Anomaly Detection?
Near duplicate detection algorithms have been proposed and implemented in order to detect and eliminate duplicate entries from massive datasets. Due to the differences in data representation (such as measurement units) across different data sources, potential duplicates may not be textually identical, even though they refer to the same real-world entity. As data warehouses typically contain data coming from several heterogeneous data sources, detecting near duplicates in a data warehouse requires a considerable memory and processing power.
Traditionally, near duplicate detection algorithms are sequential and operate on a single computer. While parallel and distributed frameworks have recently been exploited in scaling the existing algorithms to operate over larger datasets, they are often focused on distributing a few chosen algorithms using frameworks such as MapReduce. A common distribution strategy and framework to parallelize the execution of the existing similarity join algorithms is still lacking.
In-Memory Data Grids (IMDG) offer a distributed storage and execution, giving the illusion of a single large computer over multiple computing nodes in a cluster. This paper presents the research, design, and implementation of ∂u∂u, a distributed near duplicate detection framework, with preliminary evaluations measuring its performance and achieved speed up. ∂u∂u leverages the distributed shared memory and execution model provided by IMDG to execute existing near duplicate detection algorithms in a parallel and multi-tenanted environment. As a unified near duplicate detection framework for big data, ∂u∂u efficiently distributes the algorithms over utility computers in research labs and private clouds and grids.
Adaptive Intrusion Detection Using Learning ClassifiersPatrick Nicolas
This is an introduction to adaptive intrusion detection systems using rules-based learning classifiers. After listing the limitation of the current clustering and supervised learning techniques, the presentation describes a new class of learning algorithms used for detecting and preventing intrusion in computer networks and data center. Security policies are constantly upgraded or downgrades to adjust to ever changing IT environment, organization and regulations, by combining Genetic Algorithm and Reinforcement learning.
Data centers offer computational resources with various levels of guaranteed performance to the tenants, through differentiated Service Level Agreements (SLA). Typically, data center and cloud providers do not extend these guarantees to the networking layer. Since communication is carried over a network shared by all the tenants, the performance that a tenant application can achieve is unpredictable and depends on factors often beyond the tenant’s control.
We propose ViTeNA, a Software-Defined Networking-based virtual network embedding algorithm and approach that aims to solve these problems by using the abstraction of virtual networks. Virtual Tenant Networks (VTN) are isolated from each other, offering virtual networks to each of the tenants, with bandwidth guarantees. Deployed along with a scalable OpenFlow controller, ViTeNA allocates virtual tenant networks in a work-conservative system. Preliminary evaluations on data centers with tree and fat-tree topologies indicate that ViTeNA achieves both high consolidation on the allocation of virtual networks and high data center resource utilization.
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
Presentation at University of Lisbon on Machine Learning and big data.
Deep learning algorithms and applications to credit risk analysis, churn detection and recommendation algorithms
Describes a link between KM technologies and business strategy through context-specific KM inititiatives. Paper presented at CATI 2005, Congresso Anual de Tecnologia de Informa��o, S�o Paulo, Brazil.
As the complexity of choosing optimised and task specific steps and ML models is often beyond non-experts, the rapid growth of machine learning applications has created a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. We call the resulting research area that targets progressive automation of machine learning AutoML.
Although it focuses on end users without expert knowledge, AutoML also offers new tools to machine learning experts, for example to:
1. Perform architecture search over deep representations
2. Analyse the importance of hyperparameters.
Analysis and Design for Intrusion Detection System Based on Data MiningPritesh Ranjan
Reference:
Dyuanyang Zhao, Zhilin Feng, Qingxiang Xu, “Analysis and design for Intrusion detection system based on data mining” in proceedings of 2010 IEEE second international workshop on education technology and computer science
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements.
In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are
the de facto batch-processing system.
In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is
used in this layer.
In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way.
One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset.
Why Spark for lambda architecture? Traditionally, different
technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will
discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.
Slide presentasi ini dibawakan oleh Imron Zuhri dalam acara Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
A chi-square-SVM based pedagogical rule extraction method for microarray data...IJAAS Team
Support Vector Machine (SVM) is currently an efficient classification technique due to its ability to capture nonlinearities in diagnostic systems, but it does not reveal the knowledge learnt during training. It is important to understand of how a decision is reached in the machine learning technology, such as bioinformatics. On the other hand, a decision tree has good comprehensibility; the process of converting such incomprehensible models into an understandable model is often regarded as rule extraction. In this paper we proposed an approach for extracting rules from SVM for microarray dataset by combining the merits of both the SVM and decision tree. The proposed approach consists of three steps; the SVM-CHI-SQUARE is employed to reduce the feature set. Dataset with reduced features is used to obtain SVM model and synthetic data is generated. Classification and Regression Tree (CART) is used to generate Rules as the Last phase. We use breast masses dataset from UCI repository where comprehensibility is a key requirement. From the result of the experiment as the reduced feature dataset is used, the proposed approach extracts smaller length rules, thereby improving the comprehensibility of the system. We obtained accuracy of 93.53%, sensitivity of 89.58%, specificity of 96.70%, and training time of 3.195 seconds. A comparative analysis is carried out done with other algorithms.
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...yieldWerx Semiconductor
Outlier detection is a critical research field within data mining due to its vast range of applications including fraud detection, cybersecurity, health diagnostics, and significantly for the semiconductor manufacturing industry. It refers to identifying data points that significantly deviate from expected patterns, providing crucial insights into different aspects of data. However, the ambiguity between outliers and normal behavior, evolving definitions of 'normal', application-specific techniques, and noisy data mimicking outliers, often complicate the outlier detection process. This review article offers an in-depth analysis of the most advanced outlier detection methods, presenting a thorough understanding of future research prospects.
Identifying and classifying unknown Network Disruptionjagan477830
Since the evolution of modern technology and with the drastic increase in the scale of network communication more and more network disruptions in traffic and private protocols have been taking place. Identifying and classifying the unknown network disruptions can provide support and even help to maintain the backup systems.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
New Fuzzy Logic Based Intrusion Detection Systemijsrd.com
In this paper, we present an efficient intrusion detection technique. The intrusion detection plays an important role in network security. However, many current intrusion detection systems (IDSs) are signature based systems. The signature based IDS also known as misuse detection looks for a specific signature to match, signaling an intrusion. Provided with the signatures or patterns, they can detect many or all known attack patterns, but they are of little use for as yet unknown attacks. The rate of false positives is close to nil but these types of systems are poor at detecting new attacks, variation of known attacks or attacks that can be masked as normal behavior. Our proposed solution, overcomes most of the limitations of the existing methods. The field of intrusion detection has received increasing attention in recent years. One reason is the explosive growth of the internet and the large number of networked systems that exist in all types of organizations. Intrusion detection techniques using data mining have attracted more and more interests in recent years. As an important application area of data mining, they aim to meliorate the great burden of analyzing huge volumes of audit data and realizing performance optimization of detection rules. The objective of this dissertation is to try out the intrusion detection on large dataset by classification algorithms binary class support vector machine and improved its learning time and detection rate in the field of Network based IDS.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Law firms & lawyers - rid the manual review of text documents, correspondence, etc. Text Analytics of unstructured documents signals potential knowledge that brings relevance & helps win cases. Moreover, use of text analytics helps offer small firms the same advantage that big firms have. As the information can be used to strengthen solutions and provide advice to attorneys, courtrooms will also benefit from more informed, better prepared legal teams and swift action, keeping long years of litigation away!
Developing an Artificial Immune Model for Cash Fraud Detection khawla Osama
Document from thesis done by Bsc students as graduation research , to develop a model that detect a cash card fraud base on the cash card holder pattern ,the technique used to detect fraud inspired from immune system
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
Analysis on different Data mining Techniques and algorithms used in IOTIJERA Editor
In this paper, we discusses about five functionalities of data mining in IOT that affects the performance and that
are: Data anomaly detection, Data clustering, Data classification, feature selection, time series prediction. Some
important algorithm has also been reviewed here of each functionalities that show advantages and limitations as
well as some new algorithm that are in research direction. Here we had represent knowledge view of data
mining in IOT.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Adjusting primitives for graph : SHORT REPORT / NOTES
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
1. NSL KDD Cup 99 dataset Anomaly Detection using
Machine Learning Technique
An Experiment and evaluation using Decision Tree
Under Guidance of
Dr. Kalpana Thakre
NATIONAL CONFERENCE
ON RECENT TRENDS AND
ADVANCES IN COMPUTING,
COMMUNICATION AND SECURITY
Presented by
Sujeet Raosaheb Suryawanshi
ME IT SEM III ; Roll No. 613012
2. Agenda
Anomaly Detection
Machine Learning
IDPS
Survey of algorithm
Decision Tree
Experiment with NSL KDD Cup 99
Result
Future research roadmap
3. Anomaly Detection
Intrusion Detection System / Intrusion Prevention system are
used to protect trusted networks from untrusted networks
One of the threat is Denial of Service (DoS) Attack
Approaches to detect DoS attack
1. Signature based
2. Anomaly based
Signature based deals with limited/fixed set of known threats
Anomaly-based detection technique centres on the concept of
a baseline for network behaviour, any deviation from this
baseline is considered as an anomaly.
4. Machine Learning
A scientific discipline that is concerned with the design and
development of algorithms that allow computers to learn
based on data. A major focus of machine learning research is
to automatically learn to recognize complex patterns.
This is similar to the way human brain works, humans take
decision based on the learning or experiences they have.
5. Motivation & Objective
To understand techniques available to support the vision
envisaged for “Anomaly Detection using Machine Learning
Technique”
To experiment and evaluate NSL KDD Cup 99 dataset using
Decision Tree Classifier
To understand various anomaly detection and machine
learning techniques
Identify requirements for building platform for anomaly
detection system
6. Classification of IDPS
IntrusionDetection
System Data collection
techniques
HIDS
NIDS
Data analysis
techniques
Specification
based
Anomaly based
Nearest
neighbor based
Clustering
based
K-Means
Statistical based
Classification
based
SVM
Fuzzy Logic
Genetic Algo
Decision Tree
Naive Byesian
Neural Network
Others
Signature based
7. TECHNIQUES Nearest neighbor based
detection techniques
Clustering-based
anomalies detection
techniques
Statistical
techniques
Classification
techniques
Assumption Normal
data
instances
present in dense
neighbourhoods
belong to a cluster in the
data, lie close to their closest
cluster centroid, belong to
large and dense clusters,
occur in high
probability regions
of a stochastic
model
A classifier that
can distinguish
between normal
and anomalous
classes can be
learnt in the given
feature space.
Anomalies occur far from their closest
neighbours
does not belong to any
cluster, are far away from
their closest cluster centroid,
are either too small or too
sparse clusters.
occur in the low
probability regions
of the stochastic
model
Advantages Unsupervised/semi-
supervised mode
Simplest approach
Unsupervised
Fast comparison
Unsupervised
and simple
Confidence
interval is
provided with
anomaly score
Fast testing
phase process
Improved
efficiency with
ensemble
methods
Disadvantages High computational cost
in testing phase
Difficult where several
regions are with widely
differing densities.
Difficult to identify in case
if anomalies are present
in groups.
Dependent on the
proximity measures used
High computation cost in
cluster formation phase
A data object not
belonging to any cluster
may be a noise rather
than an anomaly
Not suited for large
datasets
Fail to label anomalies in
certain cases
Fail to label
the anomalies
correctly in
certain cases
Difficult to find
best statistic
For
multivariate
data it fails to
capture the
interactions
between
different
Heavy
dependency
and reliability
on training
data
Class
imbalance
problem
8. Decision Tree SVM Naive Bayes ANN Fuzzy Logic GA K-Means
Technique Classification Classification
& Regression
Classification Classification Classification Classification Clustering
Computation cost High High Less - High - -
High dimensional
data
Yes Yes Yes Yes - - -
Advantages Easy to
understand
for smaller
trees
Handles
irrelevant and
missing data
Compact after
pruning
High
detection
accuracy.
Learning
ability for
small set of
samples.
High
training rate
and
decision
rate,
insensitiven
ess to
dimension
of input
data
Easy
constructio
n
Takes
short
computatio
n time;
Works
efficiently
with large
dataset
Ability to
generalize
from
limited,
noisy and
incomplete
data.
Ease of use
Detect
unknown
intrusions.
Supports
multiclass
detection.
Permits a
data point
to be in
more than
one cluster.
It has a
more
natural
representat
ion of the
behavior of
genes. It’s
effective,
especially
against port
scans and
probes.
Derives
best
classificatio
n rules.
Selects
optimal
parameters
.
Simple to
use.
Disadvantage Fails to
classify a
scattered
data
Uses greedy
algorithm,
hence may
not find best
tree
Positive &
negative
examples
req.
High
dependenc
y on
selecting
good kernel
function.
Training
takes a long
time.
Difficult to
handle
continuous
features.
Highly
dependent
on prior
knowledge.
Training
required
Needs to
be
emulated.
Longer
training
process.
Over-fitting
issue
Need to
determine
membershi
p cutoff
value
Clusters
are
sensitive to
initial
assignment
of centroids
Can’t
assure
constant
optimizatio
n response
times.
Over-fitting
issue
Necessity
of
specifying
k.
Sensitive
to noise
Clusters
are
sensitive
to initial
assignme
nt of
centroids.
9. Decision Tree Classifier
Algorithm : Decision tree
1. Split(node, {example}):
2. A the best attribute for splitting the {examples}
3. Decision attribute for this node A
4. For each value of A, create new child node
5. Split training {examples} to child nodes
6. For each child node/subset:
If subset is pure: STOP
Else: Split(node,{subset})
10. Entropy
For selecting best attribute:
At each step, find the attribute that can be used to partition the
dataset to minimise the entropy of the data
A completely homogeneous sample has entropy of 0.
An equally divided sample has entropy of 1.
Entropy(s) = - p+log2 (p+) -p-log2 (p-) for a sample of negative and
positive elements.
The formula for entropy is:
11. Decision Tree – Sample Dataset
Years
Experience Employed?
Previous
employers
Level of
Education
Top-tier
school Interned Hired
10 Y 4 BS N N Y
0 N 0 BS Y Y Y
7 N 6 BS N N N
2 Y 1 MS Y N Y
20 N 2 PhD Y N N
0 N 0 PhD Y Y Y
5 Y 2 MS N Y Y
3 N 1 BS N Y Y
15 Y 5 BS N N Y
0 N 0 BS N N N
1 N 1 PhD Y N N
4 Y 1 BS N Y Y
0 N 0 PhD Y N Y
13. Steps for creating and evaluating Model
1) Import data
2) Edit Metadata
3) Convert Indicator Values
4) Select Columns in dataset
5) Feature selection
6) “Decision Tree” on
separate partitions
7) Score Model by adding
scored labels and scored
possibilities
10) Evaluate model using
Precision, Recall and False
positive rate
11) Compare performance
and conclude which model to
be used
14. Activity Diagram
Import Data
Read training set
Convert to indicator
Values
Replace Class column with
indicator values
Select Columns in Dataset
Remove diff level column
along with other
unnecessary columns
Import Data
Read training set
Convert to indicator
Values
Replace Class column with
indicator values
Select Columns in Dataset
Remove diff level column
along with other
unnecessary columns
Feature Selection
Select 15 most important
features
Two-Class
Decision
Tree
Two-Class
Decision Tree Tune Model
Tune Model
Score Model
Score Model
Evaluate Model
Generate and compare
scores
Generate Table that
summarises result
Evaluate Model
Generate and compare
scores
For model testing
ForModelcreationandtuning
15. Results
Total Records = ~1.25 Lacs (125973)
Model Building = ~75K (60%)
Test Model = ~50K (40%)
Precision(Positive predictive value)= TP/(TP + FP)
Recall (True Positive Rate) = TP/(TP+FN)
False positive rate (FPR), Fall-out, probability of false alarm = FP/Total
Negative
Depth of a tree precision recall false positive rate precision recall false positive rate
5 0.986469 0.986458 0.014073 0.969968 0.969788 0.029288
10 0.996714 0.996713 0.003258 0.98458 0.984557 0.01519
15 0.998297 0.998297 0.00173 0.98616 0.986121 0.01346
20 0.998258 0.998258 0.001764 0.986866 0.986814 0.012658
25 0.998258 0.998258 0.001764 0.98705 0.986992 0.012443
All Features Selected Features
16. Future research roadmap
Work with other algorithms - Random Forest, SVM, K-Means,
Logistic Regression and observe if ensemble methodology can
further enhance the model
Build real time anomaly detection using the same approach
and methodology
17. [1] K. H. Rao, “Implementation of Anomaly Detection Technique Using Machine Learning Algorithms,” International Journal of Computer Science and Telecommunication, vol. 2, no. 3, pp. 25-31, 2011.
[2] D. K. &. M. Karami, “A Comprehensive Survey on Anomaly-Based Intrusion Detection,” Computer and Information Science, vol. 5, no. 4, pp. 132-140, 2012.
[3] S. S. Ravneet Kaur, “A survey of data mining and social network analysis based anomaly detection techniques,” Egyptian Informatics Journal, vol. 2016, no. 17, p. 199–216, 2016.
[4] A. M. V. M. Niharika Sharma, “Machine Learning Techniques Used in Detection of DOS Attacks: A Literature Review Attacks: A Literature Review,” International Journal of Advanced Research in
Computer Science and Software Engineering, vol. 6, no. 3, pp. 100-106, 2016.
[5] A. N. H. H. J. Salima Omar, “Machine Learning Techniques for Anomaly Detection: An Overview,” International Journal of Computer Applications (0975 8887), vol. 79, no. 2, 2013.
[6] M. H. Dunham, Data Minig, PEARSON, 2013.
[7] M. K. Rashmi Hebbar, “Network Attack Detection Using Machine Learning Approach,” in International Conference , “Computational Systems for Health & Sustainability”, Bangalore, 2015.
[8] M. J. N. Jayveer Singh, “A Survey on Machine Learning Techniques for Intrusion Detection Systems,” International Journal of Advanced Research in Computer and Communication Engineering, Pune,
2013.
[9] G. S. J. M. Harjinder Kaur, “A review of Machine Learning based Anamoly Detection Techniques,” International Journal of Computer Applications Technology and Research, vol. 2, no. 2, pp. 185-187,
2013.
[10] M. R. A. R. O. M. R. F. M. S. D. F. A. K. H. Nutan farah haq, “Application of Machine Learning Approaches in Intrusion Detection System: A Survey,” (IJARAI) International Journal of Advanced Research
in Artificial Intelligence, vol. 4, no. 3, pp. 9-19, 2015.
[11] S. J. Peyman Asgharzadeh, “A SURVEY ON INTRUSION DETECTION SYSTEM BASED SUPPORT VECTOR MACHINE ALGORITHM,” INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER
APPLICATIONS AND ROBOTICS, vol. 3, no. 12, pp. 42-50, 2015.
[12] J. A. Shikha Agrawal, “Survey on Anomaly Detection using Data Mining Techniques,” in International Conference on Knowledge Based and Intelligent Information and Engineering Systems, Department of
Computer Science and Engineering, Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal, India, 2015.
[13] M. S. H. M. D. A. Asghar Ali Shah, “Analysis of Machine Learning Techniques for Intrusion Detection System: A Review,” International Journal of Computer Applications, vol. 119, no. 3, pp. 19-40, June
2015.
[14] N. P. f. Intelligent, “Numenta,” 2015. [Online]. Available: https://numenta.com/assets/pdf/whitepapers/Numenta%20White%20Paper%20-%20Science%20of%20Anomaly%20Detection.pdf.
[15] A. B. a. V. K. VARUN CHANDOLA, “Anomaly Detection : A Survey,” ACM Computing Surveys, Minneapolis and St. Paul, Minnesota, 2009.
[16] J. W. B. Sergio Armando Gutierrez, Application of Machine Learning Techniques to Distributed Denial of Service (DDoS) Attack Detection: A Systematic Literature Review, Medell´ın, 2012.
[17] J. Goldberg, “RSA,” 2013. [Online]. Available: http://www.rsaconference.com/writable/presentations/file_upload/ht-t08-_big-data_-for-security-purposes_how-can-i-put-big-data-to-work-for-me_copy1.pdf.
[18] “splunk,” 2015. [Online]. Available: https://www.splunk.com/web_assets/pdfs/secure/Splunk_as_a_SIEM_Tech_Brief.pdf. [Accessed 15 April 2016].
[19] B. J. B. A. A. S. David J. Weller-Fahy, “A Survey of Distance and Similarity Measures Used Within Network Intrusion Anomaly Detection,” IEEE COMMUNICATION SURVEYS & TUTORIALS, vol. 17, no.
Bibliography
22. Results
Total Records = ~1.25 Lacs (125973)
Model Building = ~75K (60%)
Test Model = ~50K (40%)
Anomaly Normal
Anomaly 26887
(TP)
111 (FN) (Type II)
Normal 554 (FP) 22957 (TN)
Total 27441 23068
Accuracy = (TP+TN)/Total=49884/50509 = 0.9868
Precision = TP/(TP + FP)=26887/(26887+554)=0.9798
Actual
Predicted (All Features)
Anomaly Normal Total
26366
(TP)
632(FN) (Type II) 26998
698 (FP) 22813(TN) 23511
27064 23445 50509
Predicted (Selected Features)
Accuracy= 0.9736
Precision= 0.9742
23. Results
Description Precision Recall Area Under ROC
1. Decision Tree, Full
Data Accuracy
0.9951 0.9764 98.62%
2. Decision Tree,
Selected Feature Data
Accuracy
0.9730 0.9703 97.35%
• Precision (Positive Predictive Value) PPV=TP/TP+FP
• Recall (True Positive Rate) TPR =TP/TP+FN
• Area under ROC: Plot of true positive rate (TPR, or specificity)
against false positive rate (FPR, or 1 - sensitivity), which is all a
Receiver Operating Characteristics (ROC) curve.
25. Decision Trees
Supervised technique
Entropy
A measure of dataset’s order-How same or different it is
If we classify dataset into N different classes
0=all classes are same
1=classes are different
At each step, find the attribute that can be used to partition the data set to
minimise the entropy of the data
A completely homogeneous sample has entropy of 0.
An equally divided sample has entropy of 1.
Entropy(s) = - p+log2 (p+) -p-log2 (p-) for a sample of negative and positive elements.
The formula for entropy is:
Greedy algorithm is used
Demo : Refer Excel
26. Support Vector Machines
Supervised technique
Works well for classifying higher-dimensional data
Finds higher-dimensional support vectors across which to divide the data
Kernels can be used to represent data in higher dimensional spaces to find
hyperplanes that might not be apparent in lower dimensions
Types:
Linear
Polynomial (Curves)
RBF
Functions takes low dimensional input space and transform it to a higher dimensional
space i.e. it converts not separable problem to separable problems.
Useful in non-linear separation problem. Simply put, it does some extremely complex
data transformations, then find out the process to separate the data based on the labels
or outputs you’ve defined.
Computationally expensive
Plot each data item as a point in n-dimensional space (where n is number of
features you have) with the value of each feature being the value of a particular
coordinate.
Perform classification by finding the hyper-plane that differentiate the two classes
Use Train test to decide the model
27. Support Vector Machines
ADV:
Works well when clear
separation exists
It uses a subset of training
points in the decision function
(called support vectors), so it is
also memory efficient.
Works well for high dimensional
data
DISADV
It doesn’t perform well,
when we have large data set
because the required training time
is higher
when the data set has more noise
i.e. target classes are overlapping
SVM doesn’t directly provide
probability estimates, these are
calculated using an expensive
five-fold cross-validation.
Noise may create issue
28. Naïve Bayes
Classification technique based on Bayes Theorem
Bayes Theorem
P(A|B)=P(A)P(B|A)/P(B)
Efficient in computation as compared to decision trees
Naïve Bayesian Network can be represented in using DAG,
Each node represents attribute
Each link represents influence of one node to another
Calculate probability and sum it up and as per threshold predict.
Demo: Spam Classifier
P(spam|free)=P(spam)P(free|spam) / P(free)
Probability of message being spam and containing word ‘free’ / overall
probability of having word ‘free’
29. Naïve Bayes
ADV:
Construction is easy and also
takes short computation time;
It can be applied to large
dataset since it does not
involve in complicated
parameter;
Interpretation of knowledge
representation; &
Encodes probabilistic
relationships among the
variables of interest. Ability to
incorporate both Prior
knowledge and data.
DISADV
Harder to handle continuous
features. May not contain any
good classifiers if prior
knowledge is wrong.
30. K-Means Clustering
Iterative clustering technique based on splitting of data into K
groups that are closes to K centroids
Unsupervised learning based on the position of each element
Can uncover interesting grouping
Randomly pick K centroids
Assign each data point to its closes centroid
Recompute the centroids based on their average position
Iterate until points stop changing
If want to predict cluster for new data points, just check it is closest to
which centroid
31. K-Means Clustering
ADV:
Less complex
DISADV
Choosing right value of K
Labelling of cluster to be
done manually
Sensitive to noise
32. Nature of Input Data
binary, categorical or continuous.
Univariate/multivariate
Nature of attributes determines the applicability of anomaly
detection techniques
E.g., Statistical techniques to be used for continuous and categorical
data.
33. Data Labels
Based on the extent to which the labels are available, anomaly
detection techniques can operate in one of the following three
modes:
Supervised
Semi-Supervised
Unsupervised
35. Challenges
Defining a normal region
Anomalous observations appear like normal
Notion of an anomaly
Availability of labeled data
Noise
36. Key components
Research Areas
Machine Learning
Data Mining
Information Theory
Spectral Theory
……….
Anomaly
Detection
Technique
Problem
Characteristics
Nature of Data
Labels
Anomaly Type
Output
………
……….
Application Domains
Intrusion Detection
Fraud Detection
………
………
……….
Baseline can be considered as description of the type of network behaviour that can be accepted or is normal, any deviation from this baseline is considered as an anomaly.
assign an anomaly score to each instance in the test data depending on the degree to which that instance is considered an anomaly.
analyst may choose to either analyze top few anomalies or use a cut-off threshold to select the anomalies.
assign a label (normal or anomalous) to each test instance.
Defining a normal region which encompasses every possible normal behavior is very difficult.
Make the anomalous observations appear like normal
The exact notion of an anomaly is different for different application domains.
Availability of labeled data for training/validation of models used by anomaly detection techniques is usually a major issue.
Often the data contains noise which tends to be similar to the actual anomalies and hence is difficult to distinguish and remove.
Defining a normal region which encompasses every possible normal behavior is very difficult.
Make the anomalous observations appear like normal
The exact notion of an anomaly is different for different application domains.
Availability of labeled data for training/validation of models used by anomaly detection techniques is usually a major issue.
Often the data contains noise which tends to be similar to the actual anomalies and hence is difficult to distinguish and remove.