Finding interesting patterns in data can lead to uncovering new knowledge. New patterns that haven’t occurred before can signify events of interest. Depending on context, these can be called novelties, anomalies, outliers or events. Whatever they are called, they are interesting because they tell a story different from the norm. In this talk, we will call them anomalies. Two diverse applications of anomaly detection are detecting fraudulent credit card transactions and identifying astronomical anomalies such as solar flares.
However, there are many challenges in anomaly detection including high false positive rates and low predictive accuracy. Ensemble learning is a way of combining many algorithms or models to obtain better predictive performance. Anomaly detection is generally an unsupervised task, that is, we do not train models using labelled data. Constructing an unsupervised anomaly detection ensemble is challenging because we do not know the labels. In this talk we discuss two topics in anomaly detection. First, we introduce an anomaly detection ensemble using Item Response Theory (IRT) – a class of models used in educational psychometrics. Using IRT we construct an ensemble that can downplay noisy, non-discriminatory methods and accentuate sharper methods.
Then we explore anomaly detection in computer network security. With cyber incidents and data breaches becoming increasingly common, we have seen a massive increase in computer network attacks over the years. Anomaly detection methods, even though used to detect suspicious behaviour, are criticized for high false positive rates. In addition, computer networks produce a large amount of complex data. We go through the end-to-end process of detecting anomalies in this scenario and show how we can minimize false positives and visualise anomalies developing over time.
Getting better at detecting anomalies by using ensemblesCSIRO
Ensemble learning combines many algorithms or models to obtain better predictive performance. Ensembles have produced the winning algorithm in competitions such as the Netflix Prize. They are used in climate modelling and relied upon to make daily forecasts.
In this talk we will explore an anomaly detection ensemble. Anomaly detection is used is many practical applications including detecting intrusions in computer networks. Anomaly detection is generally an unsupervised task, that is, we do not train models using labelled data. Constructing an unsupervised anomaly detection ensemble is challenging because we do not know the labels.
We use Item Response Theory (IRT) – a class of models used in educational psychometrics – to construct an unsupervised anomaly detection ensemble. IRT’s latent trait computation lends itself to anomaly detection because the latent trait can be used to uncover the hidden ground truth (labels). Using a novel IRT mapping to the anomaly detection problem, we construct an ensemble that can downplay noisy, non-discriminatory methods and accentuate sharper methods.
Why are anomalies important? Because they tell us a different story from the norm. An anomaly or an event might signify a failing heart rate of a patient, a fraudulent credit card activity, or an early indication of a tsunami. As such, it is extremely important to detect anomalies or anomalous events.
In this talk, we will give an introduction to anomaly detection. Anomalies are rare events. As a result, standard accuracy measures do not apply. But then, how do we evaluate an Anomaly Detection (AD) method? If we want to compare two or more AD methods, what kind of simple tests can we do? What are the data repositories that are available for AD?
We will also discuss an ensemble method for AD. Constructing an AD ensemble is challenging because the class labels are not known. We will look at an unusual ally from psychometrics – Item Response Theory – to help us in this construction.
Searching for Anomalies, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Alex Pinto
We could all have predicted this with our magical Big Data analytics platforms, but it seems that Machine Learning is the new hotness in Information Security. A great number of startups with ‘cy’ and ‘threat’ in their names that claim that their product will defend or detect more effectively than their neighbour's product "because math". And it should be easy to fool people without a PhD or two that math just works.
Indeed, math is powerful and large scale machine learning is an important cornerstone of much of the systems that we use today. However, not all algorithms and techniques are born equal. Machine Learning is a most powerful tool box, but not every tool can be applied to every problem and that’s where the pitfalls lie.
This presentation will describe the different techniques available for data analysis and machine learning for information security, and discuss their strengths and caveats. The Ghost of Marketing Past will also show how similar the unfulfilled promises of deterministic and exploratory analysis were, and how to avoid making the same mistakes again.
Finally, the presentation will describe the techniques and feature sets that were developed by the presenter on the past year as a part of his ongoing research project on the subject, in particular present some interesting results obtained since the last presentation on DefCon 21, and some ideas that could improve the application of machine learning for use in information security, especially in its use as a helper for security analysts in incident detection and response.
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
History and Present Developments in Machine Learning, by Tom Dietterich, Emeritus Professor of computer science at Oregon State University and Chief Scientist at BigML.
*Machine Learning School in The Netherlands 2022.
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
Detecting and Improving Distorted Fingerprints using rectification techniques.sandipan paul
In this detection and improving distorted fingerprint using rectification techniques like SVM, PCA classifier etc.
In this ppt a distorted fingerprint is taken and improve that distorted fingerprint into normal one.
Getting better at detecting anomalies by using ensemblesCSIRO
Ensemble learning combines many algorithms or models to obtain better predictive performance. Ensembles have produced the winning algorithm in competitions such as the Netflix Prize. They are used in climate modelling and relied upon to make daily forecasts.
In this talk we will explore an anomaly detection ensemble. Anomaly detection is used is many practical applications including detecting intrusions in computer networks. Anomaly detection is generally an unsupervised task, that is, we do not train models using labelled data. Constructing an unsupervised anomaly detection ensemble is challenging because we do not know the labels.
We use Item Response Theory (IRT) – a class of models used in educational psychometrics – to construct an unsupervised anomaly detection ensemble. IRT’s latent trait computation lends itself to anomaly detection because the latent trait can be used to uncover the hidden ground truth (labels). Using a novel IRT mapping to the anomaly detection problem, we construct an ensemble that can downplay noisy, non-discriminatory methods and accentuate sharper methods.
Why are anomalies important? Because they tell us a different story from the norm. An anomaly or an event might signify a failing heart rate of a patient, a fraudulent credit card activity, or an early indication of a tsunami. As such, it is extremely important to detect anomalies or anomalous events.
In this talk, we will give an introduction to anomaly detection. Anomalies are rare events. As a result, standard accuracy measures do not apply. But then, how do we evaluate an Anomaly Detection (AD) method? If we want to compare two or more AD methods, what kind of simple tests can we do? What are the data repositories that are available for AD?
We will also discuss an ensemble method for AD. Constructing an AD ensemble is challenging because the class labels are not known. We will look at an unusual ally from psychometrics – Item Response Theory – to help us in this construction.
Searching for Anomalies, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Alex Pinto
We could all have predicted this with our magical Big Data analytics platforms, but it seems that Machine Learning is the new hotness in Information Security. A great number of startups with ‘cy’ and ‘threat’ in their names that claim that their product will defend or detect more effectively than their neighbour's product "because math". And it should be easy to fool people without a PhD or two that math just works.
Indeed, math is powerful and large scale machine learning is an important cornerstone of much of the systems that we use today. However, not all algorithms and techniques are born equal. Machine Learning is a most powerful tool box, but not every tool can be applied to every problem and that’s where the pitfalls lie.
This presentation will describe the different techniques available for data analysis and machine learning for information security, and discuss their strengths and caveats. The Ghost of Marketing Past will also show how similar the unfulfilled promises of deterministic and exploratory analysis were, and how to avoid making the same mistakes again.
Finally, the presentation will describe the techniques and feature sets that were developed by the presenter on the past year as a part of his ongoing research project on the subject, in particular present some interesting results obtained since the last presentation on DefCon 21, and some ideas that could improve the application of machine learning for use in information security, especially in its use as a helper for security analysts in incident detection and response.
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
History and Present Developments in Machine Learning, by Tom Dietterich, Emeritus Professor of computer science at Oregon State University and Chief Scientist at BigML.
*Machine Learning School in The Netherlands 2022.
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
Detecting and Improving Distorted Fingerprints using rectification techniques.sandipan paul
In this detection and improving distorted fingerprint using rectification techniques like SVM, PCA classifier etc.
In this ppt a distorted fingerprint is taken and improve that distorted fingerprint into normal one.
Algorithm evaluation using item response theoryCSIRO
Item Response Theory (IRT) is a paradigm within the field of Educational Psychometrics, that is used to assess student ability and test question difficulty and discrimination power. IRT has recently been applied to evaluate
machine learning algorithm performance on a classification dataset. Here, we present a modified IRT-based framework for evaluating a portfolio of algorithms across a repository of datasets, while eliciting a suite of richer characteristics such as stability, effectiveness and anomalousness, that describe different aspects of algorithm performance.
Smartphones as ubiquitous devices for behavior analysis and better lifestyle ...University of Geneva
Final PhD Defence presented in March 2016 at the University of Padua, Italy. 3 years PhD under the supervision of Prof. Ombretta Gaggi. Work focused on how it is possible to use smartphone to understand and analyse user behaviour, and how it is possible to use this information to further promote better lifestyle to individuals.
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
Machine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.
The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...Alex Pinto
This session will center on a market-centric and technological exploration of commercial and open-source threat intelligence feeds that are becoming common to be offered as a way to improve the defense capabilities of organizations.
While not all Threat Intelligence can be represented as "indicator feeds", this space has enough market attention that it deserves a proper scientific, evidence-based investigation so that practitioners and decision makers can maximize the results they are able to get for the data they have available.
The presentation will consist of a data-driven analysis of a cross-section of threat intelligence feeds (both open-source and commercial) to measure their statistical bias, overlap, and representability of the unknown population of breaches worldwide. All the statistical code written and research data used (from the publicly available feeds) will be made available in the spirit of reproducible research. The tool itself will be able to be used by attendees to perform the same type of tests on their own data (called tiq-test).
Some of the important questions and answers that emerge in this presentation include:
"Are Threat Intelligence Feeds a statistical good measure of the population of 'bad stuff' happening out there? Is there even such a thing?"
"How tuned to YOUR specific threat surface are those feeds?"
"Can we actually make good use of them even if the threats they describe have no overlap with the actual incidents you have been seeing in your environment? (hint: probably not)"
We will provide an open-source tool for attendees to extract, normalize and export data from threat intelligence feeds to use in their internal projects and systems. It will be pre-configured with current publicly available network feeds and easily extensible for private or commercial feeds (called combine).
Anomaly Detection and Automatic Labeling with Deep LearningAdam Gibson
Adam Gibson demonstrates how to use variational autoencoders to automatically label time series location data. You'll explore the challenge of imbalanced classes and anomaly detection, learn how to leverage deep learning for automatically labeling (and the pitfalls of this), and discover how you can deploy these techniques in your organization.
الموعد الإثنين 03 يناير 2022
143
مبادرة
#تواصل_تطوير
المحاضرة ال 143 من المبادرة
المهندس / محمد الرافعي طرباي
نقيب المبرمجين بالدقهلية
بعنوان
"IT INDUSTRY"
How To Getting Into IT With Zero Experience
وذلك يوم الإثنين 03 يناير2022
السابعة مساء توقيت القاهرة
الثامنة مساء توقيت مكة المكرمة
و الحضور من تطبيق زووم
https://us02web.zoom.us/meeting/register/tZUpf-GsrD4jH9N9AxO39J013c1D4bqJNTcu
علما ان هناك بث مباشر للمحاضرة على القنوات الخاصة بجمعية المهندسين المصريين
ونأمل أن نوفق في تقديم ما ينفع المهندس ومهمة الهندسة في عالمنا العربي
والله الموفق
للتواصل مع إدارة المبادرة عبر قناة التليجرام
https://t.me/EEAKSA
ومتابعة المبادرة والبث المباشر عبر نوافذنا المختلفة
رابط اللينكدان والمكتبة الالكترونية
https://www.linkedin.com/company/eeaksa-egyptian-engineers-association/
رابط قناة التويتر
https://twitter.com/eeaksa
رابط قناة الفيسبوك
https://www.facebook.com/EEAKSA
رابط قناة اليوتيوب
https://www.youtube.com/user/EEAchannal
رابط التسجيل العام للمحاضرات
https://forms.gle/vVmw7L187tiATRPw9
ملحوظة : توجد شهادات حضور مجانية لمن يسجل فى رابط التقيم اخر المحاضرة
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
Detecting anomalous patterns in data can lead to significant actionable insights in a wide variety of application domains, such as fraud detection, network traffic management, predictive healthcare, energy monitoring and many more.
However, detecting anomalies accurately can be difficult. What qualifies as an anomaly is continuously changing and anomalous patterns are unexpected. An effective anomaly detection system needs to continuously self-learn without relying on pre-programmed thresholds.
Join our speakers Ravishankar Rao Vallabhajosyula, Senior Data Scientist, Impetus Technologies and Saurabh Dutta, Technical Product Manager - StreamAnalytix, in a discussion on:
Importance of anomaly detection in enterprise data, types of anomalies, and challenges
Prominent real-time application areas
Approaches, techniques and algorithms for anomaly detection
Sample use-case implementation on the StreamAnalytix platform
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...Jowin John Chemban
By:
Jowin John Chemban (jowinchemban@gmail.com)
HGW16CS022 (2016-2020 Batch)
S7 B.Tech Computer Science Engineering
Holy Grace Academy of Engineering, Mala
Date : September 2019
The IT industry is in the middle of one of its regular swings between centralisation and decentralisation, driven by increased automation of the network fabric itself, as well as new use cases such as IoT. With more and more processing and autonomy devolved to the edges, old assumptions about how to manage and operate a network have to change. It no longer makes sense to try to forward all the edge alerts to a central location for analysis, but central visibility on the health of the network is more important than ever.
New techniques in AI-enabled observability hold the promise of helping NOC teams deliver better experiences for users of their networks, without requiring excessive manual work or heavy impact on on Operations teams’ personal lives. Machine learning enables knowledge to be captured and made available in context when it is most useful, accelerating incident resolution and improving user experiences.
Malicious software are categorized into families based on
their static and dynamic characteristics, infection methods, and nature of threat. Visual exploration of malware instances and families in a low dimensional space helps in giving a first overview about dependencies and
relationships among these instances, detecting their groups and isolating outliers. Furthermore, visual exploration of different sets of features is useful in assessing the quality of these sets to carry a valid abstract representation, which can be later used in classification and clustering algorithms to achieve a high accuracy. We investigate one of
the best dimensionality reduction techniques known as t-SNE to reduce the malware representation from a high dimensional space consisting of
thousands of features to a low dimensional space. We experiment with
different feature sets and depict malware clusters in 2-D.
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Chris Hammerschmidt
achine Learning for DFIR with Velociraptor: From Setting Expectations to a Case Study
By Christian Hammerschmidt, PhD - Head of Engineering/ML, APTA Technologies
Machine learning (ML) or artificial intelligence (AI) often comes with great promise and large marketing budgets for cybersecurity, especially in monitoring (such as EDR/XDR solutions). Post-breach, it often turns out that the actual performance falls short of its promises.
In this talk, we’ll briefly look at ML for DFIR: What tasks can ML solve, generally speaking? What requirements do we have for a useful ML system in cybersecurity/DFIR contexts, such as reliability, robustness to attackers, and explainability? What makes ML difficult to apply in cybersecurity, e.g. when thinking about false alerts or attackers attempting to circumvent automated systems?
After discussing the basics, we look at ML for velociraptor:
How can we process forensic data collected with VQL using machine learning (with a typical Python/Jupyter/scikit-learn/PyTorch stack)?
And how can we build artifacts that run ML directly on each endpoint, avoiding central data collection?
The talk concludes with a case study, showing how we significantly reduced time to analyze EVTX files in incident response cases, saving thousands of USD in costs and reducing time to resolution.
Bio: Chris Hammerschmidt did his PhD research on machine learning methods for reverse engineering software systems. Now, he’s heading APTA Technologies, a start-up building machine learning tools to understand software behavior .
Affiliation: APTA Technologies, https://apta.tech
How can we evaluate a portfolio of algorithms to extract meaningful interpretations about them? Suppose we have a set of algorithms. These can be classification, regression, clustering or any other type of algorithm. And suppose we have a set of problems that these algorithms can work on. We can evaluate these algorithms on the problems and get the results. From these results, can we explain the algorithms in a meaningful way? To find an answer to this question we turn to social sciences. Methodologies in social sciences focus on explanations as opposed to accurate predictions.
Item Response Theory (IRT) is a methodology in educational psychometrics that is used to design, analyse and score test questions and questionnaires. IRT can measure hidden qualities such as stress proneness, political inclinations, or verbal/mathematical ability. Participants take tests and IRT is used to determine the ability of participants and discrimination and difficulty of test questions. In this talk we use a novel mapping of the traditional IRT framework modified to the algorithm evaluation domain. Using this new mapping, we elicit a richer suite of characteristics including stability and anomalousness that describe important aspects of algorithm performance. We find the strengths and weaknesses of algorithms in the problem space. Using the algorithm strengths and weaknesses we construct a smaller portfolio of algorithms that gives good performance.
Explainable insights on algorithm performanceCSIRO
Machine Learning (ML) and Artificial Intelligence (AI) have made great strides in this decade. We have a plethora of ML algorithms that can be used to perform a given task, be it face recognition, image classification or natural language processing. However, explainability of ML/AI algorithms remains a big problem. Explainable AI (XAI) is a branch of ML that is devoted to unravelling the black-box nature of AI so that we understand the reasons behind the decisions/output. However, there are concerns that XAI sometimes produce “tools for computer scientists to explain things to other computer scientists”, which defeats its purpose. To this end, a growing number of researchers have called for integration with social sciences to make truly explainable and trustworthy AI, because philosophy and social sciences have debated the meaning and function of an explanation for millennia and have deeper insights1. In this talk, we present such an integration2.
Our problem domain is algorithm evaluation, which considers a portfolio of algorithms and its performance on a set of problems. For example, it can be a portfolio of regression algorithms. The goal is to understand meaningful, explainable insights about the algorithms from the performance results. As the social science linkage, we use Item Response Theory (IRT), a methodology from educational psychometrics. IRT is traditionally used to evaluate the difficulty and discrimination of test questions and the ability of students and has causal interpretations. Using IRT we obtain explainable insights about algorithms relating to their stable/consistent nature, the difficulty level of problems they can handle and their behaviour. In addition, we visualise the problem spectrum and find regions on the spectrum where algorithms exhibit strengths. The causal interpretations of IRT transfer to the algorithm evaluation domain as we gain a deeper understanding of algorithms.
References
1. Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif Intell 267, 1–38 (2019).
2. Kandanaarachchi, S. & Smith-Miles, K. Comprehensive Algorithm Portfolio Evaluation using Item Response Theory. Journal of Machine Learning Research 24, 1–52 (2023).
Algorithm evaluation using item response theoryCSIRO
Item Response Theory (IRT) is a paradigm within the field of Educational Psychometrics, that is used to assess student ability and test question difficulty and discrimination power. IRT has recently been applied to evaluate
machine learning algorithm performance on a classification dataset. Here, we present a modified IRT-based framework for evaluating a portfolio of algorithms across a repository of datasets, while eliciting a suite of richer characteristics such as stability, effectiveness and anomalousness, that describe different aspects of algorithm performance.
Smartphones as ubiquitous devices for behavior analysis and better lifestyle ...University of Geneva
Final PhD Defence presented in March 2016 at the University of Padua, Italy. 3 years PhD under the supervision of Prof. Ombretta Gaggi. Work focused on how it is possible to use smartphone to understand and analyse user behaviour, and how it is possible to use this information to further promote better lifestyle to individuals.
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
Machine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.
The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...Alex Pinto
This session will center on a market-centric and technological exploration of commercial and open-source threat intelligence feeds that are becoming common to be offered as a way to improve the defense capabilities of organizations.
While not all Threat Intelligence can be represented as "indicator feeds", this space has enough market attention that it deserves a proper scientific, evidence-based investigation so that practitioners and decision makers can maximize the results they are able to get for the data they have available.
The presentation will consist of a data-driven analysis of a cross-section of threat intelligence feeds (both open-source and commercial) to measure their statistical bias, overlap, and representability of the unknown population of breaches worldwide. All the statistical code written and research data used (from the publicly available feeds) will be made available in the spirit of reproducible research. The tool itself will be able to be used by attendees to perform the same type of tests on their own data (called tiq-test).
Some of the important questions and answers that emerge in this presentation include:
"Are Threat Intelligence Feeds a statistical good measure of the population of 'bad stuff' happening out there? Is there even such a thing?"
"How tuned to YOUR specific threat surface are those feeds?"
"Can we actually make good use of them even if the threats they describe have no overlap with the actual incidents you have been seeing in your environment? (hint: probably not)"
We will provide an open-source tool for attendees to extract, normalize and export data from threat intelligence feeds to use in their internal projects and systems. It will be pre-configured with current publicly available network feeds and easily extensible for private or commercial feeds (called combine).
Anomaly Detection and Automatic Labeling with Deep LearningAdam Gibson
Adam Gibson demonstrates how to use variational autoencoders to automatically label time series location data. You'll explore the challenge of imbalanced classes and anomaly detection, learn how to leverage deep learning for automatically labeling (and the pitfalls of this), and discover how you can deploy these techniques in your organization.
الموعد الإثنين 03 يناير 2022
143
مبادرة
#تواصل_تطوير
المحاضرة ال 143 من المبادرة
المهندس / محمد الرافعي طرباي
نقيب المبرمجين بالدقهلية
بعنوان
"IT INDUSTRY"
How To Getting Into IT With Zero Experience
وذلك يوم الإثنين 03 يناير2022
السابعة مساء توقيت القاهرة
الثامنة مساء توقيت مكة المكرمة
و الحضور من تطبيق زووم
https://us02web.zoom.us/meeting/register/tZUpf-GsrD4jH9N9AxO39J013c1D4bqJNTcu
علما ان هناك بث مباشر للمحاضرة على القنوات الخاصة بجمعية المهندسين المصريين
ونأمل أن نوفق في تقديم ما ينفع المهندس ومهمة الهندسة في عالمنا العربي
والله الموفق
للتواصل مع إدارة المبادرة عبر قناة التليجرام
https://t.me/EEAKSA
ومتابعة المبادرة والبث المباشر عبر نوافذنا المختلفة
رابط اللينكدان والمكتبة الالكترونية
https://www.linkedin.com/company/eeaksa-egyptian-engineers-association/
رابط قناة التويتر
https://twitter.com/eeaksa
رابط قناة الفيسبوك
https://www.facebook.com/EEAKSA
رابط قناة اليوتيوب
https://www.youtube.com/user/EEAchannal
رابط التسجيل العام للمحاضرات
https://forms.gle/vVmw7L187tiATRPw9
ملحوظة : توجد شهادات حضور مجانية لمن يسجل فى رابط التقيم اخر المحاضرة
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
Detecting anomalous patterns in data can lead to significant actionable insights in a wide variety of application domains, such as fraud detection, network traffic management, predictive healthcare, energy monitoring and many more.
However, detecting anomalies accurately can be difficult. What qualifies as an anomaly is continuously changing and anomalous patterns are unexpected. An effective anomaly detection system needs to continuously self-learn without relying on pre-programmed thresholds.
Join our speakers Ravishankar Rao Vallabhajosyula, Senior Data Scientist, Impetus Technologies and Saurabh Dutta, Technical Product Manager - StreamAnalytix, in a discussion on:
Importance of anomaly detection in enterprise data, types of anomalies, and challenges
Prominent real-time application areas
Approaches, techniques and algorithms for anomaly detection
Sample use-case implementation on the StreamAnalytix platform
Seminar Presentation | Network Intrusion Detection using Supervised Machine L...Jowin John Chemban
By:
Jowin John Chemban (jowinchemban@gmail.com)
HGW16CS022 (2016-2020 Batch)
S7 B.Tech Computer Science Engineering
Holy Grace Academy of Engineering, Mala
Date : September 2019
The IT industry is in the middle of one of its regular swings between centralisation and decentralisation, driven by increased automation of the network fabric itself, as well as new use cases such as IoT. With more and more processing and autonomy devolved to the edges, old assumptions about how to manage and operate a network have to change. It no longer makes sense to try to forward all the edge alerts to a central location for analysis, but central visibility on the health of the network is more important than ever.
New techniques in AI-enabled observability hold the promise of helping NOC teams deliver better experiences for users of their networks, without requiring excessive manual work or heavy impact on on Operations teams’ personal lives. Machine learning enables knowledge to be captured and made available in context when it is most useful, accelerating incident resolution and improving user experiences.
Malicious software are categorized into families based on
their static and dynamic characteristics, infection methods, and nature of threat. Visual exploration of malware instances and families in a low dimensional space helps in giving a first overview about dependencies and
relationships among these instances, detecting their groups and isolating outliers. Furthermore, visual exploration of different sets of features is useful in assessing the quality of these sets to carry a valid abstract representation, which can be later used in classification and clustering algorithms to achieve a high accuracy. We investigate one of
the best dimensionality reduction techniques known as t-SNE to reduce the malware representation from a high dimensional space consisting of
thousands of features to a low dimensional space. We experiment with
different feature sets and depict malware clusters in 2-D.
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Chris Hammerschmidt
achine Learning for DFIR with Velociraptor: From Setting Expectations to a Case Study
By Christian Hammerschmidt, PhD - Head of Engineering/ML, APTA Technologies
Machine learning (ML) or artificial intelligence (AI) often comes with great promise and large marketing budgets for cybersecurity, especially in monitoring (such as EDR/XDR solutions). Post-breach, it often turns out that the actual performance falls short of its promises.
In this talk, we’ll briefly look at ML for DFIR: What tasks can ML solve, generally speaking? What requirements do we have for a useful ML system in cybersecurity/DFIR contexts, such as reliability, robustness to attackers, and explainability? What makes ML difficult to apply in cybersecurity, e.g. when thinking about false alerts or attackers attempting to circumvent automated systems?
After discussing the basics, we look at ML for velociraptor:
How can we process forensic data collected with VQL using machine learning (with a typical Python/Jupyter/scikit-learn/PyTorch stack)?
And how can we build artifacts that run ML directly on each endpoint, avoiding central data collection?
The talk concludes with a case study, showing how we significantly reduced time to analyze EVTX files in incident response cases, saving thousands of USD in costs and reducing time to resolution.
Bio: Chris Hammerschmidt did his PhD research on machine learning methods for reverse engineering software systems. Now, he’s heading APTA Technologies, a start-up building machine learning tools to understand software behavior .
Affiliation: APTA Technologies, https://apta.tech
How can we evaluate a portfolio of algorithms to extract meaningful interpretations about them? Suppose we have a set of algorithms. These can be classification, regression, clustering or any other type of algorithm. And suppose we have a set of problems that these algorithms can work on. We can evaluate these algorithms on the problems and get the results. From these results, can we explain the algorithms in a meaningful way? To find an answer to this question we turn to social sciences. Methodologies in social sciences focus on explanations as opposed to accurate predictions.
Item Response Theory (IRT) is a methodology in educational psychometrics that is used to design, analyse and score test questions and questionnaires. IRT can measure hidden qualities such as stress proneness, political inclinations, or verbal/mathematical ability. Participants take tests and IRT is used to determine the ability of participants and discrimination and difficulty of test questions. In this talk we use a novel mapping of the traditional IRT framework modified to the algorithm evaluation domain. Using this new mapping, we elicit a richer suite of characteristics including stability and anomalousness that describe important aspects of algorithm performance. We find the strengths and weaknesses of algorithms in the problem space. Using the algorithm strengths and weaknesses we construct a smaller portfolio of algorithms that gives good performance.
Explainable insights on algorithm performanceCSIRO
Machine Learning (ML) and Artificial Intelligence (AI) have made great strides in this decade. We have a plethora of ML algorithms that can be used to perform a given task, be it face recognition, image classification or natural language processing. However, explainability of ML/AI algorithms remains a big problem. Explainable AI (XAI) is a branch of ML that is devoted to unravelling the black-box nature of AI so that we understand the reasons behind the decisions/output. However, there are concerns that XAI sometimes produce “tools for computer scientists to explain things to other computer scientists”, which defeats its purpose. To this end, a growing number of researchers have called for integration with social sciences to make truly explainable and trustworthy AI, because philosophy and social sciences have debated the meaning and function of an explanation for millennia and have deeper insights1. In this talk, we present such an integration2.
Our problem domain is algorithm evaluation, which considers a portfolio of algorithms and its performance on a set of problems. For example, it can be a portfolio of regression algorithms. The goal is to understand meaningful, explainable insights about the algorithms from the performance results. As the social science linkage, we use Item Response Theory (IRT), a methodology from educational psychometrics. IRT is traditionally used to evaluate the difficulty and discrimination of test questions and the ability of students and has causal interpretations. Using IRT we obtain explainable insights about algorithms relating to their stable/consistent nature, the difficulty level of problems they can handle and their behaviour. In addition, we visualise the problem spectrum and find regions on the spectrum where algorithms exhibit strengths. The causal interpretations of IRT transfer to the algorithm evaluation domain as we gain a deeper understanding of algorithms.
References
1. Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif Intell 267, 1–38 (2019).
2. Kandanaarachchi, S. & Smith-Miles, K. Comprehensive Algorithm Portfolio Evaluation using Item Response Theory. Journal of Machine Learning Research 24, 1–52 (2023).
Sophisticated tools for spatio-temporal data explorationCSIRO
Abstract: Spatio-temporal data underpin many critical processes such as weather, crop production, wild-fire spread and epidemiological and disease function. Models of these processes can reveal changing charac-teristics in both space and time and can help inform decision-makers. A recent example is during the pan-demic years, spatio-temporal models were used to inform public policy. While there are many spatio-temporal modelling methods and packages, tools specifically designed for exploratory data analysis are somewhat lacking. Exploratory data analysis is a vital step in the end-to-end process of statistical and ma-chine learning modelling. A lack of tools for exploratory spatio-temporal data analysis may lead to research-ers starting the modelling process prematurely and make suboptimal modelling choices. We aim to fill this gap by contributing stxplore – an R package equipped with useful functionality designed for spatio-temporal data exploration.
Explainable algorithm evaluation from lessons in educationCSIRO
How can we evaluate a portfolio of algorithms to extract meaningful interpretations about them? Suppose we have a set of algorithms. These can be classification, regression, clustering or any other type of algorithm. And suppose we have a set of problems that these algorithms can work on. We can evaluate these algorithms on the problems and get the results. From these results, can we explain the algorithms in a meaningful way? The easy option is to find which algorithm performs best for each problem and find the algorithm that performs best on the greatest number of problems. But, there is a limitation with this approach. We are only looking at the overall best! Suppose a certain algorithm gives the best performance on hard problems, but not on easy problems. We would miss this algorithm by using the “overall best” approach. How do we obtain a salient set of algorithm features?
A time series of networks. Is everything OK? Are there anomalies?CSIRO
Consider how bills get voted in the Parliament/Congress. Members belonging to different parties may vote differently. As time passes the voting patterns can change. These bill voting patterns can be denoted as a network. At each time stamp, a different network emerges. The collection of networks indexed by time is a time series -- of networks. We study these network time series. What are the features of these networks? How do the features change over time? Are there anomalous networks? We investigate these questions using real world networks. We use graph theoretic features to transform the network to a feature space and model their evolution using time series methods. Then we find anomalous networks using time series residuals. Our results coincide with noteworthy, historical events.
Comparison of geostatistical methods for spatial dataCSIRO
With so many spatial/spatio-temporal modelling techniques to choose from, which model would you select to model the data you’re interested in? In this talk, we discuss a newcomer’s perspective of spatial modelling and compare four modelling techniques on Malaria prevalence in Kenya. The four models of interest are 1. Integrated Nested Laplace Approximations, commonly known as INLA, 2. Spatial Random Forests, an extension of random forests to the spatial domain, 3. GPBoost, a tree boosting technique with Gaussian Processes and 4. Fixed Rank Kriging. We will discuss the challenges associated with a comparison of such diverse models and share some results.
Algorithm evaluation using Item Response TheoryCSIRO
How do you evaluate a portfolio of algorithms? Suppose we have the results for a set of algorithms on a given set of problems. We can find which algorithm performs best for each problem and find the algorithm that performs best on the greatest number of problems. But, there is a limitation with this approach. We are only looking at the overall best! Suppose a certain algorithm gives the best performance on hard problems, but not on easy problems. We would miss this algorithm by using the “overall best” approach.
Item Response Theory (IRT) is used to design, analyse and score test questions/questionnaires that measure hidden qualities such as stress proneness, political inclinations, or verbal/mathematical ability. It is a methodology used in educational psychometrics. Participants take tests and IRT is used to determine the ability of participants and discrimination and difficulty of test questions. We use a novel mapping of the traditional IRT framework modified to the algorithm evaluation domain. Using this new mapping, we elicit a richer suite of characteristics including stability and anomalousness that describe important aspects of algorithm performance. We find the strengths and weaknesses of algorithms in the problem space. Using the algorithm strengths and weaknesses we construct a smaller portfolio of algorithms that gives good performance.
Evaluating algorithms using Item Response TheoryCSIRO
How do we evaluate a portfolio of algorithms? Suppose we have the results for a set of algorithms on a given set of problems. We can find which algorithm performs best for each problem and find the algorithm that performs best on the greatest number of problems. But, there is a limitation with this approach. We are only looking at the overall best! Suppose a certain algorithm gives the best performance on hard problems, but not on easy problems. We would miss this algorithm by using the “overall best” approach.
Item Response Theory (IRT) is used to design, analyse and score test questions and questionnaires that measure abilities and attitudes. It is a methodology used in psychometrics. Participants take tests and IRT is used to determine the ability of participants and discrimination and difficulty of test questions.
We use a novel mapping of the traditional IRT framework modified to the algorithm evaluation domain. Using this new mapping, we elicit a richer suite of characteristics including stability, anomalousness and effectiveness that describe important aspects of algorithm performance.
Why are anomalies important? Because they tell us a different story from the norm. An anomaly might signify a failing heart rate of a patient, a fraudulent credit card activity, or an early indication of a tsunami. As such, it is extremely important to detect anomalies.
There are many anomaly detection algorithms available. Most algorithms have parameters. Parameters are a tricky business because users need to set them. Sometimes it is not clear how to set these parameters. For example, there are anomaly detection algorithms that use kernel density estimates to detect anomalies. But they require the user to set the bandwidth. Setting the bandwidth for anomaly detection is different from setting the bandwidth for general kernel density estimation. Especially in high dimensions this is not an obvious task.
In this talk, we introduce lookout, a new approach that uses topological data analysis to select the bandwidth for anomaly detection. Using this bandwidth lookout uses leave-one-out kernel density estimates and extreme value theory to detect anomalies.
We also define the concept of anomaly persistence, which explores the birth and death of anomalies as the bandwidth changes. If a data point is identified as an anomaly for a large range of bandwidth values, then its significance as an anomaly increases.
The R package lookout implements this algorithm.
Why should we care about anomalies? They demand our attention because they are telling a different story from the norm. An anomaly might signify a failing heart rate of a patient, a fraudulent credit card activity, or an early indication of a tsunami. As such, it is extremely important to detect anomalies.
What are the challenges in anomaly detection? As with many machine/statistical learning tasks high dimensional data poses a problem. Another challenge is selecting appropriate parameters. Yet another challenge is high false positive rates.
In this talk we introduce two R packages – dobin and lookout - that address different challenges in anomaly detection. Dobin is a dimension reduction technique especially catered to anomaly detection. So, dobin is somewhat similar PCA; but dobin puts anomalies in the forefront. We can use dobin as a pre-processing step and find anomalies using fewer dimensions.
On the other hand, lookout is an anomaly detection method that uses kernel density estimates and extreme value theory. But there is a difference. Generally, anomaly detection methods that use kernel density estimates require a user-defined bandwidth parameter. But does the user know how to specify this elusive bandwidth parameter? Lookout addresses this challenge by constructing an appropriate bandwidth for anomaly detection using topological data analysis, so the user doesn’t need to specify a bandwidth parameter. Furthermore, lookout has a low false positive rate because it uses extreme value theory.
We also introduce the concept of anomaly persistence, which explores the birth and death of anomalies as the bandwidth changes. If a data point is identified as an anomaly for a large range of bandwidth values, then its significance as an anomaly is high.
Anomalies are interesting because they tell a different story from the norm. Anomaly detection is used in many applications including detecting fraudulent credit card transactions and attacks in computer networks. But we do not want anomaly detection algorithms to be “alarm factories”, because if too many anomalies are detected on a regular basis, they tend to be ignored by the decision makers. Also, many anomaly detection methods have parameters that can only be set by experts, making them difficult to be used by lay people. Therefore, it is important to have “parameter-free” anomaly detection methods that minimize false positives.
In this talk, we introduce lookout, an anomaly detection method that uses extreme value theory and topological data analysis. Lookout is essentially parameter-free and has low false positive rates. We also delve into the world of computer networks and show how lookout can be used to detect suspicious nodes in computer network traffic.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
2. Overview
• Why is finding interesting patterns in data important?
• Methodology: Item Response Theory to construct an anomaly
detection ensemble
• An application: computer networks
• Next challenges
2
4. Interesting patterns in data – Why?
• We live in a data-rich world
• Phones and personal smart devices
• Videos/CCTV
• Satellites roaming around the planet
• Social media and content generation
• Wearable technology (heart rate monitor)
This Photo by Unknown
Author is licensed under CC
BY-SA
4
5. What should we focus upon?
• Impossible to go through all the data in real time
• But we want to know when something “important”
happens
• Important – context dependent
• A person who is monitored has had a fall (wearables)
• Deforestation (satellite data)
• A group harmful to society is gaining popularity (social
media – national security)
• A bushfire starting off
This Photo by Unknown Author is licensed under CC BY
This Photo by Unknown Author is licensed under
CC BY-ND
5
6. Challenges
• Automated tools to extract these events of interest
• Early detection is super important
• High accuracy
• Low false positive rates
• Complex, noisy data
Goals
• To allocate resources effectively and efficiently
• Prevent disaster from happening
• Or minimize the loss
6
7. A critical piece – finding the interesting bit
• It can be called many names
• Events, anomalies, outliers, novelties, emerging threats
• Can’t always train a model to find the interesting bit
• Can’t lock in what is interesting
• Training a model on certain fraud/intrusions/cyber attacks is not optimal, because there are
new types of fraud/attacks, always!
• Antivirus – known viruses only
• You want something more “intelligent” and accurate
• Alerts you when something weird happens with high accuracy
• Flexible (can evolve)
• A shift of focus over time
• Previously outliers were detected to be discarded – they make the model worse
• Now, we want to know about the anomalies – they are telling us something interesting
7
8. We looked at interesting patterns in data.
Next, we look at some specific research.
8
9. An anomaly
detection
ensemble using
Item Response
Theory
Unsupervised Anomaly Detection Ensembles using Item Response
Theory
Sevvandi Kandanaarachchi
Information Sciences (2022)
9
10. What are we trying to do?
Achieve Higher
Accuracy
New methods with
better accuracy
Build an ensemble
from existing
methods
10
11. What are we trying to do?
Achieve Higher
Accuracy
New methods with
better accuracy
Build an ensemble
from existing
methods
11
12. Specific challenges
• In regression we have 𝑥, 𝑦 → (𝑥, 𝑦 )
• So you can use e = 𝑦 − 𝑦 in your ensemble
• The models can be weighted by their accuracy
But…
• Unsupervised anomaly detection does not have 𝑦
• We have 𝑥 → each AD method gives 𝑦1, 𝑦2, 𝑦3, 𝑦4 → Ensemble gives
𝑦𝑒𝑛𝑠
12
13. What is an anomaly detection ensemble?
Dataset
Unsupervised
AD methods
The AD methods are heterogenous methods
AD ensemble
Ensemble
Score
The data 𝑥 The anomaly scores 𝑦1, 𝑦2 𝑦3 𝑦4, 𝑦5 , 𝑦6, 𝑦7
13
14. We use Item Response Theory to construct
the ensemble
Explain IRT
How we use it to construct an AD ensemble
14
15. What is Item Response Theory (IRT)?
• A set of models used in educational psychometrics/social sciences
• Premise - intrinsic “quality” that cannot be measured directly
• Racial prejudice or stress proneness
• Political inclinations
• Verbal or mathematical ability
• A test instrument
• A survey
• Exam
This Photo by Unknown Author is
licensed under CC BY-SA
15
16. IRT
Survey responses
Exam marks
IRT Model
Output
Discrimination of each test item
Difficulty of each test item
Participant ability (hidden quality)
16
17. IRT in education
• 𝑁 Students answer 𝑛 questions
• Your input to the IRT model is a matrix of
marks 𝑌𝑁×𝑛
• Fit the IRT model
• You get as your output
• Test item discrimination
• Test item difficulty
• Student ability (latent trait)
• Focus is on item discrimination and
difficulty
Q 1 Q 2 Q 3 Q 4
Stu 1 0.95 0.87 0.67 0.84
Stu 2 0.57 0.49 0.78 0.77
Stu n 0.75 0.86 0.57 0.45
17
18. IRT in psychometrics
• A survey
• Rosenberg's Self-Esteem Scale
• I feel I am a person of worth (Strongly Agree/Agree/Neutral/... )
• Use original responses (no marking as in education)
• Fit the IRT model
• Output
• Participants self-esteem (hidden quality = latent trait)
• Question discrimination
• Question difficulty
• Focus is on the hidden ability
18
19. IRT in Data Science/Machine Learning
• Relatively new area of research
• From performance data find
• Ability of classifiers
• Discrimination/difficulty of datasets
• 2019 - Item response theory in AI: Analysing machine learning
classifiers at the instance level – F. Martínez-Plumed et al.
19
20. IRT ensemble for anomaly detection
Latent trait = the anomalousness of the observations = the
ensemble score
High values → high anomalousness, low values → low
anomalousness
Matrix of
anomaly scores
𝑌𝑁×𝑛
IRT Model
=
20
21. Example
• AD methods (DDoutlier, h2o, e1071)
• KNN_AGG
• LOF
• COF
• INFLO
• KDEOS
• LDF
• LDOF
• Autoencoders – Deep learning
• OCSVM – One class Support Vector
Machine
• Isolation Forest – Tree based method
Nearest neighbourhood-based
methods
Density/distance based
Dataset
Unsupervised
AD methods
AD ensemble
Ensemble
Score
21
25. Why does it work?
• Ensemble scores
𝜃𝑖 =
𝑗 𝛼𝑗
2
(𝛽𝑗+𝛾𝑗𝑧𝑖𝑗)
𝑗 𝛼𝑗
2
𝜃𝑖 - ensemble score for the 𝑖𝑡ℎ observation
𝛼𝑗 - discrimination
𝛾𝑗 - scaling parameters for the 𝑗𝑡ℎ AD method
𝛽𝑗 - difficulty
𝑧𝑖𝑗 - anomaly score of the 𝑗𝑡ℎ AD method on the 𝑖𝑡ℎ observation
25
26. Why does it work?
𝜃𝑖 =
𝑗 𝛼𝑗
2
(𝛽𝑗+𝛾𝑗𝑧𝑖𝑗)
𝑗 𝛼𝑗
2 = 𝑗(𝑐𝑗 + 𝑤𝑗𝑧𝑖𝑗)
• Ensemble scores are a weighted average of the original anomaly
scores
• The weights 𝑤𝑗 depend on the discrimination and scaling parameters
of each anomaly detection method
• AD Methods with higher discrimination get higher weights
• Ensemble accentuates better methods and downplays noisy methods
Each AD
method has a
weight
26
27. This work
• R package – outlierensembles – on CRAN
• Extends R package EstCRM for IRT
• Includes other anomaly detection ensembles as well
• More details on the paper https://arxiv.org/abs/2106.06243
27
28. We looked at an AD ensemble.
Next, we dive into an application.
28
29. An application in
computer network
security
Honeyboost: Boosting honeypot performance with data fusion and
anomaly detection
Sevvandi Kandanaarachchi, Hideya Ochiai (UTokyo), Asha Rao (RMIT)
Expert Systems with Applications (2022)
29
30. LAN Security Monitoring Project
• Between 12 ASEAN and
SAARC countries
• Boost cyber-resilience
among partners
• Countries in low
economic conditions
• Cost effective methods
• Focus on Local Area
Networks (LAN)
Average Monthly Malware Encounter Rate, 2018
(Microsoft, Security Intelligence Report, 2019)
About 10 nodes in Japan
3 nodes in Malaysia
1 node in Laos
6 nodes
in Thailand
2 nodes
in Myanmar
4 nodes in Indonesia
2 nodes in Cambodia
2 nodes
in India
2 node in
Philippines
4 nodes in Vietnam
30
31. LAN: Local Area Network
LAN-Security Monitoring
Device (honeypot)
Smartphones
Printer
Smart Appliances
Data Server
Inside a Local Area Network (LAN)
• Devices communicating with each other
• Any suspicious behaviour?
• Detect malware in action
31
32. The Data
• Several protocol features
• Features derived by
looking at packet headers
• Features specific to the
protocol
• Each protocol has a
different number of
features
Timestamp From_Node F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
1553585825, '172.16.1.107', 80, 2, 64, 0, 2, 0, 0, 0, 0, 1, 1
1553585890, '172.16.1.107', 80, 2, 64, 0, 2, 0, 0, 0, 0, 1, 1
1553660565, '172.16.1.107', 80, 1, 64, 0, 1, 0, 0, 0, 0, 1, 1
1553660570, '172.16.1.107', 80, 1, 64, 0, 1, 0, 0, 0, 0, 0, 0
1553667575, '172.16.1.107', 80, 3, 64, 0, 3, 0, 0, 0, 0, 2, 2
1553667580, '172.16.1.107', 80, 1, 64, 0, 1, 0, 0, 0, 0, 0, 0
1553751195, '172.16.1.208', 80, 1, 64, 0, 1, 0, 0, 0, 0, 0, 0
Protocol 1
Timestamp From_Node G1 G2 G3
1554351595, '172.16.1.86', 3702, 2, 652
1554351595, '172.16.1.86', 137, 2, 78
1554351595, '172.16.1.86', 1900, 4, 146
1554351595, '172.16.1.86', 7, 1, 28
Protocol 2
32
33. Varying-dimensional time series
• Sort by time, then by node
• Different protocols have different features
• Finding anomalies from varying-dimensional time series
• 400 computers/nodes = 400 varying-dimensional time series
• Which ones are anomalous?
time
33
34. Varying-
dimensional time
series for each node
multivariate time
series
Compute features
The methodology
• Using a window model
• We know the real anomalous nodes and the times (they access
something they shouldn’t - honeypot)
AD method
lookout
34
35. Varying-
dimensional time
series for each node
multivariate time
series
Timestamp Protocol ARP count ARP
degree
TCP PC1 TCP PC2 UDP PC1 UDP PC2
30 ARP 10 12 0 0 0 0
55 TCP 0 0 -2.15 1.75 0 0
85 UDP 0 0 0 0 3.56 0.45
Node A
35
37. Features
• The total length of line segments in 𝑅6
• The maximum time difference
• Number of protocols used
• Number of TCP calls/UDP calls
• Total length of line segments in each protocol space
• Line of best fit in in each protocol space
• Sum of errors squared for the line of best fit
TCP PC1
TCP PC2
37
38. AD method - lookout
• lookout - work with Rob Hyndman. Published in JCGS (2021)
• Uses Extreme Value Theory (EVT) to find anomalies
• Applicability: Computer network traffic has heavy tails – EVT can
handle that
Feature space AD method
lookout
38
39. Results
• We identify real anomalies
before they access the
honeypot (they shouldn’t do
that)
• The nodes behave in an
anomalous way before a
“breach” is triggered
• We can predict a breach using
this method
• Low false positives
• Visualize anomalies develop
• Discover patterns of suspicious
behaviour
39
40. Thoughts ...
• This was a classic data science problem
• We were given the data and the problem context and asked to tackle
it
• How do you formulate the problem?
• Many building blocks
• Identifying anomalies and visualising them aids decision making
• Bonus: open up a research avenue
• Underlying general research problem, not application-specific
40
41. Another way to think about this problem
• Model the network dynamics
• Find suspicious behaviour in a
network
• Network dynamics not commonly
used in cyber security
• Public datasets do not facilitate that
• Growth potential in this area
• Tom Bernardi’s MSc project
41
43. Next challenges for the field
• Networks
• Anomalies/events in networks (computer networks)
• Nuwan’s MSc Project on behavioural biometrics
• Visualization of networks at different granularities
• Dynamic networks – echo chambers – how they form
• Event detection in spatiotemporal data
• Applications in epidemiology
• Can you identify a hotspot before it happens?
• Ecology
• Algorithm bias
• Bias in data + bias in algorithm
43
44. Recap: ensembles to networks
• Broad applicability in detecting
interesting patterns in data
• Applications in cyber security, wearable
sensors, satellite data, social media
• Core research problem ties back to
statistics/maths
• Need robust, highly accurate
methodologies that can capture these
patterns
• Exciting field. Thrilled to be part of it!
Stats
&
Maths
This Photo by Unknown
Author is licensed under CC
BY-SA
44
48. Continuous IRT model
• Samejima, 1969 – Continuous Response Model
• Wang and Zeng, 1998 – Procedure to compute item parameters using
expectation maximization for Samejima’s model
• Shojima, 2005 – Non-iterative item parameter solution in each EM
cycle
• Zopluoglu, 2015 – EstCRM R package implements Shojima’s 2005
model
• Update the loglikelihood – to include negative discrimination items
48
50. Example with
iterations
Data in 𝑅6
- first 2 dimensions shown,
others normally distributed
Evaluation metric – Area under ROC
Iteration 5
Iteration 10
50
53. LAN Security Monitoring
• ‘LAN-Security Monitoring Device’ to capture suspicious/ malicious
activities that happen inside a LAN.
LAN: Local Area Network
LAN-Security Monitoring Device
Honeypot - a trap for attackers
Smartphones
Printer
Smart Appliances
Data Server
53
55. Findings
• Suspicious nodes that do not
access the honeypot
Feature space for
all nodes
lookout
This node
does not
access the
honeypot
This node
does not
access the
honeypot
55