Mittels statisch festgelegten Thresholds als Verhaltensgrenzen bietet Nagios nur bedingt geeignete Mittel, um ein sich über die Zeit änderndes Verhalten auf Anomalien hin zu untersuchen. Lediglich der sog. Holt-Winters Forecasting Algorithmus wurde in Nagios von Jake D. Brutlag integriert, welcher aber in einigen Fällen an seine Grenzen kommt. Viel mehr Möglichkeiten einer solchen Verhaltensanalyse werden weder durch Nagios, noch durch NagiosGrapher bereitgestellt, sodass die Idee aufkam, eine allgemeine Schnittstelle (Prototyp) in NagiosGrapher zu integrieren, um so verschiedene Algorithmen zur Extrapolation und Anomalieanalyse auszutesten und einfach auszuwechseln.
Dabei wird im Vortrag auf den Aufbau und die Funktionsweise der Schittellen eingegangen, sowie einige Methoden zur Verhaltensanalyse erläutert. Hierbei werden RRD-Daten verwendet, um Baselines, Abweichungen von diesen und deren Extrapolation zu berechnen. Infos unter: gnumaniacs.org
- The document discusses building a predictive anomaly detection model for network traffic using streaming data technologies.
- It proposes using Apache Kafka to ingest and process network packet and Netflow data in real-time, and Akka clustering to build predictive models that can guide human cybersecurity experts.
- The solution aims to more effectively guide human awareness of network threats by complementing localized rule-matching with predictive modeling of aggregate network behavior based on streaming metrics.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
The document discusses building a custom data source for Grafana using Vert.x to retrieve time series data stored in MongoDB. It describes how to connect Grafana and MongoDB through a three-tier architecture using a Vert.x microservice. The microservice would handle splitting requests, querying the database in parallel, aggregating results, and performing additional processing before returning data to Grafana. Vert.x is well-suited for this due to its asynchronous, reactive, and scalable nature. Sample code is available on GitHub to demonstrate the approach.
The document discusses machine learning techniques for graphs and graph-parallel computing. It describes how graphs can model real-world data with entities as vertices and relationships as edges. Common machine learning tasks on graphs include identifying influential entities, finding communities, modeling dependencies, and predicting user behavior. The document introduces the concept of graph-parallel programming models that allow algorithms to be expressed by having each vertex perform computations based on its local neighborhood. It presents examples of graph algorithms like PageRank, product recommendations, and identifying leaders that can be implemented in a graph-parallel manner. Finally, it discusses challenges of analyzing large real-world graphs and how systems like GraphLab address these challenges through techniques like vertex-cuts and asynchronous execution.
Josh Patterson, Principal at Patterson Consulting: Introduction to Parallel Iterative Machine Learning Algorithms on Hadoop’s NextGeneration YARN Framework
Introduction to Data streaming - 05/12/2014Raja Chiky
Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.
Introduction to Deep Learning with Pythonindico data
A presentation by Alec Radford, Head of Research at indico Data Solutions, on deep learning with Python's Theano library.
The emphasis of the presentation is high performance computing, natural language processing (using recurrent neural nets), and large scale learning with GPUs.
Video of the talk available here: https://www.youtube.com/watch?v=S75EdAcXHKk
- The document discusses building a predictive anomaly detection model for network traffic using streaming data technologies.
- It proposes using Apache Kafka to ingest and process network packet and Netflow data in real-time, and Akka clustering to build predictive models that can guide human cybersecurity experts.
- The solution aims to more effectively guide human awareness of network threats by complementing localized rule-matching with predictive modeling of aggregate network behavior based on streaming metrics.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
The document discusses building a custom data source for Grafana using Vert.x to retrieve time series data stored in MongoDB. It describes how to connect Grafana and MongoDB through a three-tier architecture using a Vert.x microservice. The microservice would handle splitting requests, querying the database in parallel, aggregating results, and performing additional processing before returning data to Grafana. Vert.x is well-suited for this due to its asynchronous, reactive, and scalable nature. Sample code is available on GitHub to demonstrate the approach.
The document discusses machine learning techniques for graphs and graph-parallel computing. It describes how graphs can model real-world data with entities as vertices and relationships as edges. Common machine learning tasks on graphs include identifying influential entities, finding communities, modeling dependencies, and predicting user behavior. The document introduces the concept of graph-parallel programming models that allow algorithms to be expressed by having each vertex perform computations based on its local neighborhood. It presents examples of graph algorithms like PageRank, product recommendations, and identifying leaders that can be implemented in a graph-parallel manner. Finally, it discusses challenges of analyzing large real-world graphs and how systems like GraphLab address these challenges through techniques like vertex-cuts and asynchronous execution.
Josh Patterson, Principal at Patterson Consulting: Introduction to Parallel Iterative Machine Learning Algorithms on Hadoop’s NextGeneration YARN Framework
Introduction to Data streaming - 05/12/2014Raja Chiky
Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.
Introduction to Deep Learning with Pythonindico data
A presentation by Alec Radford, Head of Research at indico Data Solutions, on deep learning with Python's Theano library.
The emphasis of the presentation is high performance computing, natural language processing (using recurrent neural nets), and large scale learning with GPUs.
Video of the talk available here: https://www.youtube.com/watch?v=S75EdAcXHKk
ODSC 2019: Sessionisation via stochastic periods for root event identificationKuldeep Jiwani
In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world, or a stream of web requests by a monitoring / crawling service in the web world, or may be a hacker's bot sitting on internet and attacking various computers. Although we may not be able to know the motive or intention behind these data sources. But via some unsupervised techniques we can try to infer the pattern or correlate the events based on their multiple occurrences on the axis of time. Associating a chain of events in order of time helps in doing a root event analysis. In certain cases a time ordered correlation and root event identification is good enough to automatically identify signatures of various malicious actors and take appropriate corrective actions to stop cyber attacks, stop malicious social campaigns, etc.
Sessionisation is one such unsupervised technique that tries to find the signal in a stream of events associated with a timestamp. In the ideal world it would resolve to finding periods with a mixture of sinusoidal waves. But for the real world this is a much complex activity, as even the systematic events generated by machines over the internet behave in a much erratic manner. So the notion of a period for a signal also changes in the real world. We can no longer associate it with a number, it has to be treated as a random variable, with expected values and associated variance. Hence we need to model "Stochastic periods" and learn their probability distributions in an unsupervised manner.
The main focus of this talk will be to showcase applied data science techniques to discover stochastic periods. There are many ways to obtain periods in data, so the journey would begin by a walk through of existing techniques like FFT (Fast Fourier Transform) then discuss about Gaussian Mixture Models. After highlighting the short comings of these techniques we will succinctly explain one of the most general non-parametric Bayesian approaches to solve this problem. Without going too deep in the complex math, we will get back to applied data science and discuss a much simpler technique that can solve the same problem if certain assumptions are satisfied.
In this talk we will demonstrate some time based pattern we discovered while working on a security analytics use case that uses Sessionisation. In the talk we will demonstrate such patterns based on an open source malware attack datasets that is available publicly.
Key concepts explained in talk: Sessionisation, Bayesian techniques of Machine Learning, Gaussian Mixture Models, Kernel density estimation, FFT, stochastic periods, probabilistic modelling, Bayesian non-parametric methods
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016MLconf
Alex Smola is the Manager of the Cloud Machine Learning Platform at Amazon. Prior to his role at Amazon, Smola was a Professor in the Machine Learning Department of Carnegie Mellon University and cofounder and CEO of Marianas Labs. Prior to that he worked at Google Strategic Technologies, Yahoo Research, and National ICT Australia. Prior to joining CMU, he was professor at UC Berkeley and the Australian National University. Alex obtained his PhD at TU Berlin in 1998. He has published over 200 papers and written or coauthored 5 books.
Abstract summary
Personalization and Scalable Deep Learning with MXNET: User return times and movie preferences are inherently time dependent. In this talk I will show how this can be accomplished efficiently using deep learning by employing an LSTM (Long Short Term Model). Moreover, I will show how to train large scale distributed parallel models using MXNet efficiently. This includes a brief overview of key components of defining networks, of optimization, and a walkthrough of the steps required to allocate machines, and to train a model.
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
This document summarizes and evaluates several algorithms for classification of data streams: VFDTc, UFFT, and CVFDT. It describes their approaches for handling concept drift, detecting outliers and noise. The algorithms were tested on synthetic data streams generated with configurable attributes like drift frequency and noise percentage. Results show VFDTc and UFFT performed best in accuracy, while CVFDT and UFFT were fastest. The study aims to help choose algorithms suitable for different data stream characteristics like gradual vs sudden drift or frequent vs infrequent drift.
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
This document presents a distributed decision tree learning algorithm called Vertical Hoeffding Tree (VHT) for mining big data streams. It summarizes the contributions of the master's thesis, which include: (1) Developing the SAMOA framework for distributed streaming machine learning, (2) Integrating SAMOA with the Storm distributed stream processing engine, and (3) Implementing the VHT algorithm to improve scalability over the standard Hoeffding Tree algorithm when dealing with high-dimensional data streams. The evaluation shows that VHT achieves similar accuracy to Hoeffding Tree but higher throughput, especially on datasets with many attributes.
This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.
This document discusses randomized algorithms for solving regression problems on large datasets in parallel and distributed environments. It begins by motivating the need for methods that can perform "vector space analytics" at very large scales beyond what is possible with traditional graph and matrix algorithms. Randomized regression algorithms are introduced as an approach that is faster, simpler to implement, implicitly regularizes to avoid overfitting, and is inherently parallel. The document then outlines how randomized regression can be implemented in shared memory, message passing, MapReduce, and fully distributed environments.
OpenML.org: Networked Science and IoT Data Streams by Jan van Rijn, Universit...EuroIoTa
OpenML enables truly collaborative machine learning. Scientists can post important data, inviting anyone to help analyze it. OpenML structures and organizes all results online to show the state of the art and push progress.
OpenML is being integrated in most popular machine learning environments, so you can automatically upload all your data, code, and experiments. And if you develop new tools, there's an API for that, plus people to help you.
OpenML allows you to search, compare, visualize, analyze and download all combined results online. Explore the state of the art, improve it, build on it, ask questions and start discussions
Lowering the bar: deep learning for side-channel analysisRiscure
Deep learning can help automate the signal analysis process in power side channel analysis. We show how typical signal processing problems such as noise reduction and re-alignment are automatically solved by the deep learning network. We show we can break a lightly protected AES, an AES implementation with masking countermeasures and a protected ECC implementation. These experiments indicate that where previously side channel analysis had a large dependency on the skills of the human, first steps are being developed that bring down the attacker skill required for such attacks using Deep Learning automation.
Learning visual representation without human labelKai-Wen Zhao
Self supervised learning (SSL) is one of the most fast-growing research topic in recent years. SSL provides algorithm that directly learn visual representation from data itself rather than human manual labels. From theoretical point of view, SSL explores information theory & the nature of large scale dataset.
Scalable Data Science and Deep Learning with H2O
In this session, we introduce the H2O data science platform. We will explain its scalable in-memory architecture and design principles and focus on the implementation of distributed deep learning in H2O. Advanced features such as adaptive learning rates, various forms of regularization, automatic data transformations, checkpointing, grid-search, cross-validation and auto-tuning turn multi-layer neural networks of the past into powerful, easy-to-use predictive analytics tools accessible to everyone. We will present a broad range of use cases and live demos that include world-record deep learning models, anomaly detection tools and approaches for Kaggle data science competitions. We also demonstrate the applicability of H2O in enterprise environments for real-world customer production use cases.
By the end of the hands-on-session, attendees will have learned to perform end-to-end data science workflows with H2O using both the easy-to-use web interface and the flexible R interface. We will cover data ingest, basic feature engineering, feature selection, hyperparameter optimization with N-fold cross-validation, multi-model scoring and taking models into production. We will train supervised and unsupervised methods on realistic datasets. With best-of-breed machine learning algorithms such as elastic net, random forest, gradient boosting and deep learning, you will be able to create your own smart applications.
A local installation of RStudio is recommended for this session.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
World's toughest and most interesting analysis tasks lie at the intersection of graph data (inter-dependencies in data) and deep learning (inter-dependencies in the model). Classical graph embedding techniques have for years occupied research groups seeking how complex graphs can be encoded into a low-dimensional latent space. Recently, deep learning has dominated the space of embeddings generation due to its ability to automatically generate embeddings given any static graph.
Grapharis is a project that revitalizes the concept of graph embeddings, yet it does so in a real setting were graphs are not static but keep changing over time (think of user interactions in social networks). More specifically, we explored how a system like Flink can be used to simplify both the process of training a graph embedding model incrementally but also make complex inferences and predictions in real time using graph structured data streams. To our knowledge, Grapharis is the first complete data pipeline using Flink and Tensorflow for real-time deep graph learning. This talk will cover how we can train, store and generate embeddings continuously and accurately as data evolves over time without the need to re-train the underlying model.
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)Amazon Web Services
For many companies, recommendation systems solve important machine learning problems. But as recommendation systems grow to millions of users and millions of items, they pose significant challenges when deployed at scale. The user-item matrix can have trillions of entries (or more), most of which are zero. To make common ML techniques practical, sparse data requires special techniques. Learn how to use MXNet to build neural network models for recommendation systems that can scale efficiently to large sparse datasets.
This document provides an overview of object detection techniques including region-based and region-free methods. Region-based methods like R-CNN, Fast R-CNN, and Faster R-CNN first generate region proposals then extract features from those regions to classify and regress bounding boxes. Region-free methods like YOLO, YOLOv2, and SSD predict bounding boxes and classifications directly from the image in one pass. Both approaches are trained end-to-end using techniques like RoI pooling and anchor boxes to predict multiple detections. Recent work aims to improve speed and accuracy by generating detections sequentially or using soft NMS instead of hard thresholding.
1. Real-time analytics of social networks can help companies detect new business opportunities by understanding customer needs and reactions in real-time.
2. MOA and SAMOA are frameworks for analyzing massive online and distributed data streams. MOA deals with evolving data streams using online learning algorithms. SAMOA provides a programming model for distributed, real-time machine learning on data streams.
3. Both tools allow companies to gain insights from social network and other real-time data to understand customers and react to opportunities.
TensorFlow in 3 sentences
Barbara Fusinska provides a high-level overview of TensorFlow in 3 sentences or less. She demonstrates how to build a computational graph for classification tasks using APIs like tf.nn and tf.layers. Barbara encourages attendees to get involved with open source TensorFlow communities on GitHub and through tools like Docker containers.
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịHong Ong
Bài review cách tính nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị. Ứng dụng trong nhiều lĩnh vực như: telecome, internet routing, social network analysis, etc.
The document discusses computing analytics directly in databases to improve performance over traditional approaches that import all data into memory first. It presents three case studies - linear regression, correlation, and value-at-risk estimation - showing how each can leverage database operations like aggregation, sorting, and querying to perform calculations faster by computing near the data. Graphs of execution times demonstrate that partially computing models within the database rather than solely in memory leads to significant speed improvements, especially as data sizes increase.
[241]large scale search with polysemous codesNAVER D2
This document discusses using polysemous codes to perform large-scale search over visual signatures. Polysemous codes allow product quantization codes to be interpreted as both compact binary codes for efficient Hamming distance search and codes that preserve distance information for accurate nearest neighbor search. The key ideas are to learn an index assignment that maps similar product quantization codes to binary codes with smaller Hamming distance, and to directly optimize this assignment to match the distances between codebook centroids. This allows using a single code representation for both fast Hamming search and precise distance search, without increasing memory requirements. The document provides examples of applying polysemous codes to build a large graph connecting images based on visual similarity.
Suggestions:
1) For best quality, download the PDF before viewing.
2) Open at least two windows: One for the Youtube video, one for the screencast (link below), and optionally one for the slides themselves.
3) The Youtube video is shown on the first page of the slide deck, for slides, just skip to page 2.
Screencast: http://youtu.be/VoL7JKJmr2I
Video recording: http://youtu.be/CJRvb8zxRdE (Thanks to Al Friedrich!)
In this talk, we take Deep Learning to task with real world data puzzles to solve.
Data:
- Higgs binary classification dataset (10M rows, 29 cols)
- MNIST 10-class dataset
- Weather categorical dataset
- eBay text classification dataset (8500 cols, 500k rows, 467 classes)
- ECG heartbeat anomaly detection
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries and an execution framework.
2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection and recommendations.
3) The Vertical Hoeffding Tree algorithm in SAMOA provides high parallelism and accuracy for streaming decision tree learning, outperforming native Apache Flink implementations on certain datasets while being faster on others.
The document provides an overview of the Network Simulator ns-2. It discusses:
1) The history and goals of ns-2 including supporting networking research, protocol design and comparison, and providing a collaborative environment.
2) Current projects using ns-2 including SAMAN to build robust networks and CONSER to extend ns-2's capabilities.
3) The components and functionality of ns-2 including modeling wired and wireless networks, various protocols, traffic sources, and queue management.
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...Thom Lane
Summary of models and methods used for DAWNBench CIFAR-10 Challenge. Starting with an review of ResNets from high level architecture, we review Basic vs Bottleneck blocks, pre-activation blocks and Wide Resets. After a brief mention of PyramidNet, ResNext and DenseNet models, we look at regularization techniques such as Mixup. And we finish with a review of Cyclical Learning Rates, and the phenomenon of "Super Convergence".
MXNet Gluon API was used for the implementations.
ODSC 2019: Sessionisation via stochastic periods for root event identificationKuldeep Jiwani
In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world, or a stream of web requests by a monitoring / crawling service in the web world, or may be a hacker's bot sitting on internet and attacking various computers. Although we may not be able to know the motive or intention behind these data sources. But via some unsupervised techniques we can try to infer the pattern or correlate the events based on their multiple occurrences on the axis of time. Associating a chain of events in order of time helps in doing a root event analysis. In certain cases a time ordered correlation and root event identification is good enough to automatically identify signatures of various malicious actors and take appropriate corrective actions to stop cyber attacks, stop malicious social campaigns, etc.
Sessionisation is one such unsupervised technique that tries to find the signal in a stream of events associated with a timestamp. In the ideal world it would resolve to finding periods with a mixture of sinusoidal waves. But for the real world this is a much complex activity, as even the systematic events generated by machines over the internet behave in a much erratic manner. So the notion of a period for a signal also changes in the real world. We can no longer associate it with a number, it has to be treated as a random variable, with expected values and associated variance. Hence we need to model "Stochastic periods" and learn their probability distributions in an unsupervised manner.
The main focus of this talk will be to showcase applied data science techniques to discover stochastic periods. There are many ways to obtain periods in data, so the journey would begin by a walk through of existing techniques like FFT (Fast Fourier Transform) then discuss about Gaussian Mixture Models. After highlighting the short comings of these techniques we will succinctly explain one of the most general non-parametric Bayesian approaches to solve this problem. Without going too deep in the complex math, we will get back to applied data science and discuss a much simpler technique that can solve the same problem if certain assumptions are satisfied.
In this talk we will demonstrate some time based pattern we discovered while working on a security analytics use case that uses Sessionisation. In the talk we will demonstrate such patterns based on an open source malware attack datasets that is available publicly.
Key concepts explained in talk: Sessionisation, Bayesian techniques of Machine Learning, Gaussian Mixture Models, Kernel density estimation, FFT, stochastic periods, probabilistic modelling, Bayesian non-parametric methods
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016MLconf
Alex Smola is the Manager of the Cloud Machine Learning Platform at Amazon. Prior to his role at Amazon, Smola was a Professor in the Machine Learning Department of Carnegie Mellon University and cofounder and CEO of Marianas Labs. Prior to that he worked at Google Strategic Technologies, Yahoo Research, and National ICT Australia. Prior to joining CMU, he was professor at UC Berkeley and the Australian National University. Alex obtained his PhD at TU Berlin in 1998. He has published over 200 papers and written or coauthored 5 books.
Abstract summary
Personalization and Scalable Deep Learning with MXNET: User return times and movie preferences are inherently time dependent. In this talk I will show how this can be accomplished efficiently using deep learning by employing an LSTM (Long Short Term Model). Moreover, I will show how to train large scale distributed parallel models using MXNet efficiently. This includes a brief overview of key components of defining networks, of optimization, and a walkthrough of the steps required to allocate machines, and to train a model.
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
This document summarizes and evaluates several algorithms for classification of data streams: VFDTc, UFFT, and CVFDT. It describes their approaches for handling concept drift, detecting outliers and noise. The algorithms were tested on synthetic data streams generated with configurable attributes like drift frequency and noise percentage. Results show VFDTc and UFFT performed best in accuracy, while CVFDT and UFFT were fastest. The study aims to help choose algorithms suitable for different data stream characteristics like gradual vs sudden drift or frequent vs infrequent drift.
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
This document presents a distributed decision tree learning algorithm called Vertical Hoeffding Tree (VHT) for mining big data streams. It summarizes the contributions of the master's thesis, which include: (1) Developing the SAMOA framework for distributed streaming machine learning, (2) Integrating SAMOA with the Storm distributed stream processing engine, and (3) Implementing the VHT algorithm to improve scalability over the standard Hoeffding Tree algorithm when dealing with high-dimensional data streams. The evaluation shows that VHT achieves similar accuracy to Hoeffding Tree but higher throughput, especially on datasets with many attributes.
This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.
This document discusses randomized algorithms for solving regression problems on large datasets in parallel and distributed environments. It begins by motivating the need for methods that can perform "vector space analytics" at very large scales beyond what is possible with traditional graph and matrix algorithms. Randomized regression algorithms are introduced as an approach that is faster, simpler to implement, implicitly regularizes to avoid overfitting, and is inherently parallel. The document then outlines how randomized regression can be implemented in shared memory, message passing, MapReduce, and fully distributed environments.
OpenML.org: Networked Science and IoT Data Streams by Jan van Rijn, Universit...EuroIoTa
OpenML enables truly collaborative machine learning. Scientists can post important data, inviting anyone to help analyze it. OpenML structures and organizes all results online to show the state of the art and push progress.
OpenML is being integrated in most popular machine learning environments, so you can automatically upload all your data, code, and experiments. And if you develop new tools, there's an API for that, plus people to help you.
OpenML allows you to search, compare, visualize, analyze and download all combined results online. Explore the state of the art, improve it, build on it, ask questions and start discussions
Lowering the bar: deep learning for side-channel analysisRiscure
Deep learning can help automate the signal analysis process in power side channel analysis. We show how typical signal processing problems such as noise reduction and re-alignment are automatically solved by the deep learning network. We show we can break a lightly protected AES, an AES implementation with masking countermeasures and a protected ECC implementation. These experiments indicate that where previously side channel analysis had a large dependency on the skills of the human, first steps are being developed that bring down the attacker skill required for such attacks using Deep Learning automation.
Learning visual representation without human labelKai-Wen Zhao
Self supervised learning (SSL) is one of the most fast-growing research topic in recent years. SSL provides algorithm that directly learn visual representation from data itself rather than human manual labels. From theoretical point of view, SSL explores information theory & the nature of large scale dataset.
Scalable Data Science and Deep Learning with H2O
In this session, we introduce the H2O data science platform. We will explain its scalable in-memory architecture and design principles and focus on the implementation of distributed deep learning in H2O. Advanced features such as adaptive learning rates, various forms of regularization, automatic data transformations, checkpointing, grid-search, cross-validation and auto-tuning turn multi-layer neural networks of the past into powerful, easy-to-use predictive analytics tools accessible to everyone. We will present a broad range of use cases and live demos that include world-record deep learning models, anomaly detection tools and approaches for Kaggle data science competitions. We also demonstrate the applicability of H2O in enterprise environments for real-world customer production use cases.
By the end of the hands-on-session, attendees will have learned to perform end-to-end data science workflows with H2O using both the easy-to-use web interface and the flexible R interface. We will cover data ingest, basic feature engineering, feature selection, hyperparameter optimization with N-fold cross-validation, multi-model scoring and taking models into production. We will train supervised and unsupervised methods on realistic datasets. With best-of-breed machine learning algorithms such as elastic net, random forest, gradient boosting and deep learning, you will be able to create your own smart applications.
A local installation of RStudio is recommended for this session.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
World's toughest and most interesting analysis tasks lie at the intersection of graph data (inter-dependencies in data) and deep learning (inter-dependencies in the model). Classical graph embedding techniques have for years occupied research groups seeking how complex graphs can be encoded into a low-dimensional latent space. Recently, deep learning has dominated the space of embeddings generation due to its ability to automatically generate embeddings given any static graph.
Grapharis is a project that revitalizes the concept of graph embeddings, yet it does so in a real setting were graphs are not static but keep changing over time (think of user interactions in social networks). More specifically, we explored how a system like Flink can be used to simplify both the process of training a graph embedding model incrementally but also make complex inferences and predictions in real time using graph structured data streams. To our knowledge, Grapharis is the first complete data pipeline using Flink and Tensorflow for real-time deep graph learning. This talk will cover how we can train, store and generate embeddings continuously and accurately as data evolves over time without the need to re-train the underlying model.
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)Amazon Web Services
For many companies, recommendation systems solve important machine learning problems. But as recommendation systems grow to millions of users and millions of items, they pose significant challenges when deployed at scale. The user-item matrix can have trillions of entries (or more), most of which are zero. To make common ML techniques practical, sparse data requires special techniques. Learn how to use MXNet to build neural network models for recommendation systems that can scale efficiently to large sparse datasets.
This document provides an overview of object detection techniques including region-based and region-free methods. Region-based methods like R-CNN, Fast R-CNN, and Faster R-CNN first generate region proposals then extract features from those regions to classify and regress bounding boxes. Region-free methods like YOLO, YOLOv2, and SSD predict bounding boxes and classifications directly from the image in one pass. Both approaches are trained end-to-end using techniques like RoI pooling and anchor boxes to predict multiple detections. Recent work aims to improve speed and accuracy by generating detections sequentially or using soft NMS instead of hard thresholding.
1. Real-time analytics of social networks can help companies detect new business opportunities by understanding customer needs and reactions in real-time.
2. MOA and SAMOA are frameworks for analyzing massive online and distributed data streams. MOA deals with evolving data streams using online learning algorithms. SAMOA provides a programming model for distributed, real-time machine learning on data streams.
3. Both tools allow companies to gain insights from social network and other real-time data to understand customers and react to opportunities.
TensorFlow in 3 sentences
Barbara Fusinska provides a high-level overview of TensorFlow in 3 sentences or less. She demonstrates how to build a computational graph for classification tasks using APIs like tf.nn and tf.layers. Barbara encourages attendees to get involved with open source TensorFlow communities on GitHub and through tools like Docker containers.
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịHong Ong
Bài review cách tính nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị. Ứng dụng trong nhiều lĩnh vực như: telecome, internet routing, social network analysis, etc.
The document discusses computing analytics directly in databases to improve performance over traditional approaches that import all data into memory first. It presents three case studies - linear regression, correlation, and value-at-risk estimation - showing how each can leverage database operations like aggregation, sorting, and querying to perform calculations faster by computing near the data. Graphs of execution times demonstrate that partially computing models within the database rather than solely in memory leads to significant speed improvements, especially as data sizes increase.
[241]large scale search with polysemous codesNAVER D2
This document discusses using polysemous codes to perform large-scale search over visual signatures. Polysemous codes allow product quantization codes to be interpreted as both compact binary codes for efficient Hamming distance search and codes that preserve distance information for accurate nearest neighbor search. The key ideas are to learn an index assignment that maps similar product quantization codes to binary codes with smaller Hamming distance, and to directly optimize this assignment to match the distances between codebook centroids. This allows using a single code representation for both fast Hamming search and precise distance search, without increasing memory requirements. The document provides examples of applying polysemous codes to build a large graph connecting images based on visual similarity.
Suggestions:
1) For best quality, download the PDF before viewing.
2) Open at least two windows: One for the Youtube video, one for the screencast (link below), and optionally one for the slides themselves.
3) The Youtube video is shown on the first page of the slide deck, for slides, just skip to page 2.
Screencast: http://youtu.be/VoL7JKJmr2I
Video recording: http://youtu.be/CJRvb8zxRdE (Thanks to Al Friedrich!)
In this talk, we take Deep Learning to task with real world data puzzles to solve.
Data:
- Higgs binary classification dataset (10M rows, 29 cols)
- MNIST 10-class dataset
- Weather categorical dataset
- eBay text classification dataset (8500 cols, 500k rows, 467 classes)
- ECG heartbeat anomaly detection
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries and an execution framework.
2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection and recommendations.
3) The Vertical Hoeffding Tree algorithm in SAMOA provides high parallelism and accuracy for streaming decision tree learning, outperforming native Apache Flink implementations on certain datasets while being faster on others.
The document provides an overview of the Network Simulator ns-2. It discusses:
1) The history and goals of ns-2 including supporting networking research, protocol design and comparison, and providing a collaborative environment.
2) Current projects using ns-2 including SAMAN to build robust networks and CONSER to extend ns-2's capabilities.
3) The components and functionality of ns-2 including modeling wired and wireless networks, various protocols, traffic sources, and queue management.
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...Thom Lane
Summary of models and methods used for DAWNBench CIFAR-10 Challenge. Starting with an review of ResNets from high level architecture, we review Basic vs Bottleneck blocks, pre-activation blocks and Wide Resets. After a brief mention of PyramidNet, ResNext and DenseNet models, we look at regularization techniques such as Mixup. And we finish with a review of Cyclical Learning Rates, and the phenomenon of "Super Convergence".
MXNet Gluon API was used for the implementations.
The document summarizes the use of the Sector and Sphere cloud computing software on the Open Cloud Testbed for the SC08 Bandwidth Challenge. Key points include:
- Sector is a distributed storage system and Sphere simplifies distributed data processing using a map-reduce model.
- The Open Cloud Testbed provided 101 nodes across 4 locations for running applications like TeraSort (sorting 1TB of data) and CreditStone (analyzing 3TB of credit card transactions).
- Sector/Sphere applications achieved transfer rates of up to 20Gbps for TeraSort and 7.2Gbps for CreditStone, utilizing the distributed resources for large-scale data processing.
Synapse 2018 Guarding against failure in a hundred step pipelineCalvin French-Owen
Control store (ctlstore) is a new infrastructure component that solves the "n+1 problem" of independent database failures bringing down a distributed system. It replicates control data from a system of record to local SQLite databases on each service node. Ctlstore loads data transactionally from the system of record using a loader, executive, and control data log. It then replicates changes to local databases using a reflector. Snapshots to object storage allow new instances to start up with the latest data. This approach provides high availability and fast querying of shared control data across many services.
Developing and Deploying Deep Learning Based Computer Vision Systems - Alka N...CodeOps Technologies LLP
Deep Learning is enabling a wide range of computer vision applications from advanced driver assistance systems to sophisticated medical diagnostic devices. However, designing and deploying these applications involve a lot of challenges like handling large datasets, developing optimized models, effectively performing GPU computing and efficiently deploying deep learning models to embedded boards like NVIDIA Jetson. This session illustrates how MATLAB supports all phases of this workflow starting with algorithm design to automatically generating portable and optimized CUDA code helping engineers and scientists address the commonly observed challenges in deep learning workflow
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
Lens is an open source Python library for automated data exploration of large datasets using Dask. It computes summary statistics and relationships between columns in a dataset. The results are serialized to JSON for interactive exploration through Jupyter widgets or a web UI. Dask allows the computations to run in parallel across a cluster for scalability. Lens integrates with the SherlockML platform to analyze all datasets uploaded.
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...J On The Beach
Developing reliable data acquisition, processing and control modules for mission critical systems - as they run at CERN - has always been challenging. As both data volumes and rates increase, non-functional requirements such as performance, availability, and maintainability are getting more important than ever. C2MON is a modular Open Source Java framework for realising highly available, large industrial monitoring and control solutions. It has been initially developed for CERN’s demanding infrastructure monitoring needs and is based on more than 10 years of experience with the Technical Infrastructure Monitoring (TIM) systems at CERN. Combining maintainability and high-availability within a portable architecture is the focus of this work. Making use of standard Java libraries for in-memory data management, clustering and data persistence, the platform becomes interesting for many Big Data scenarios.
Real time-image-processing-applied-to-traffic-queue-detection-algorithmajayrampelli
This document describes a real-time image processing algorithm for detecting traffic queues. The algorithm uses two operations: motion detection and vehicle detection. Motion detection involves differencing consecutive image frames and comparing the difference to a threshold. Vehicle detection applies edge detection techniques to image profiles. The algorithm aims to measure queue parameters like length, occurrence period, and slope in real-time using low-cost systems. It processes image sub-profiles to reduce computation time. Experimental results found the algorithm could measure queue length with 95% accuracy.
How we use functional programming to find the bad guys @ Build Stuff LT and U...Richard Minerich
Traditional approaches in anti-money laundering involve simple matching algorithms and a lot of human review. However, in recent years this approach has proven to not scale well with the ever increasingly strict regulatory environment. We at Bayard Rock have had much success at applying fancier approaches, including some machine learning, to this problem. In this talk I walk you through the general problem domain and talk about some of the algorithms we use. I’ll also dip into why and how we leverage typed functional programming for rapid iteration with a small team in order to out-innovate our competitors.
NS2 - the network simulator which is proved useful in studying the dynamic nature of communication networks. Simulation of wired as well as wireless network functions and protocols( e.g. routing algorithms, TCP, UDP ) can be done using NS2
Detecting Hacks: Anomaly Detection on Networking DataJames Sirota
See https://medium.com/@jamessirota for a series of blog entries that goes with this deck...
Defense in Depth for Big Data
Network Anomaly Detection Overview
Volume Anomaly Detection
Feature Anomaly Detection
Model Architecture
Deployment on OpenSOC Platform
Questions
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamStewart Needham
For AAA games now there is a consumer expectation that the developer has a post release strategy. This strategy goes beyond just DLC content. Users expect to receive bug fixes, balancing updates, gamemode variations and constant tuning of the game experience. So how can you architect your game technology to facilitate all of this? Stewart explains the unique patching system developed for Crysis 3 Multiplayer which allowed the team to hot-patch pretty much any asset or data used by the game. He also details the supporting telemetry, server and testing infrastructure required to support this along with some interesting lessons learned.
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk, we’ll mention all of the aspects that you should take into consideration when monitoring a distributed system using tools like Web Services, Spark, Cassandra, MongoDB, AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Detecting Hacks: Anomaly Detection on Networking DataDataWorks Summit
This document summarizes techniques for anomaly detection on network data. It discusses defense-in-depth strategies using both misuse detection and anomaly detection. It then describes volume-based and feature-based network anomaly detection, including statistical process control techniques. The document outlines a three-phase anomaly detection process and discusses implementation in Hadoop using time series databases. It provides examples of common network anomalies and techniques for batch and online anomaly detection modeling.
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmadigital.com
Abstract: http://www.bigdataspain.org/program/thu/slot-7.html
Here are some useful GDB commands for debugging:
- break <function> - Set a breakpoint at a function
- break <file:line> - Set a breakpoint at a line in a file
- run - Start program execution
- next/n - Step over to next line, stepping over function calls
- step/s - Step into function calls
- finish - Step out of current function
- print/p <variable> - Print value of a variable
- backtrace/bt - Print the call stack
- info breakpoints/ib - List breakpoints
- delete <breakpoint#> - Delete a breakpoint
- layout src - Switch layout to source code view
- layout asm - Switch layout
Many thanks to Nick McKeown (Stanford), Jennifer Rexford (Princeton), Scott Shenker (Berkeley), Nick Feamster (Princeton), Li Erran Li (Columbia), Yashar Ganjali (Toronto)
This document discusses various techniques for optimizing deep neural network models and hardware for efficiency. It covers approaches such as exploiting activation and weight statistics, sparsity, compression, pruning neurons and synapses, decomposing trained filters, and knowledge distillation. The goal is to reduce operations, memory usage, and energy consumption to enable efficient inference on hardware like mobile phones and accelerators. Evaluation methodologies are also presented to guide energy-aware design space exploration.
Data Summer Conf 2018, “How we build Computer vision as a service (ENG)” — Ro...Provectus
During this presentation, we will look at the different versions of SaaS architectures built on the basis of ML / computer vision: – The advantages and disadvantages of using different design patterns of services – Modes of “serving” models (in most cases, TF) – Influence of architecture and the way it is implemented on product development. Bonus: Does the data scientist (y) need to know something other than data science?
Similar to OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus NagiosGrapher by Daniel Borkmann (20)
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppGoogle
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-fusion-buddy-review
AI Fusion Buddy Review: Key Features
✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini
✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique!
✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs!
✅Fully automated AI articles bulk generation!
✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more.
✅With one keyword or URL, generate complete websites, landing pages, and more…
✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7.
✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches.
✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all!
✅Save over $5000 per year and kick out dependency on third parties completely!
✅Brand New App: Not available anywhere else!
✅ Beginner-friendly!
✅ZERO upfront cost or any extra expenses
✅Risk-Free: 30-Day Money-Back Guarantee!
✅Commercial License included!
See My Other Reviews Article:
(1) AI Genie Review: https://sumonreview.com/ai-genie-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIFusionBuddyReview,
#AIFusionBuddyFeatures,
#AIFusionBuddyPricing,
#AIFusionBuddyProsandCons,
#AIFusionBuddyTutorial,
#AIFusionBuddyUserExperience
#AIFusionBuddyforBeginners,
#AIFusionBuddyBenefits,
#AIFusionBuddyComparison,
#AIFusionBuddyInstallation,
#AIFusionBuddyRefundPolicy,
#AIFusionBuddyDemo,
#AIFusionBuddyMaintenanceFees,
#AIFusionBuddyNewbieFriendly,
#WhatIsAIFusionBuddy?,
#HowDoesAIFusionBuddyWorks
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
Takashi Kobayashi and Hironori Washizaki, "SWEBOK Guide and Future of SE Education," First International Symposium on the Future of Software Engineering (FUSE), June 3-6, 2024, Okinawa, Japan
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j
Dr. Jesús Barrasa, Head of Solutions Architecture for EMEA, Neo4j
Découvrez les dernières innovations de Neo4j, et notamment les dernières intégrations cloud et les améliorations produits qui font de Neo4j un choix essentiel pour les développeurs qui créent des applications avec des données interconnectées et de l’IA générative.
SOCRadar's Aviation Industry Q1 Incident Report is out now!
The aviation industry has always been a prime target for cybercriminals due to its critical infrastructure and high stakes. In the first quarter of 2024, the sector faced an alarming surge in cybersecurity threats, revealing its vulnerabilities and the relentless sophistication of cyber attackers.
SOCRadar’s Aviation Industry, Quarterly Incident Report, provides an in-depth analysis of these threats, detected and examined through our extensive monitoring of hacker forums, Telegram channels, and dark web platforms.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
DDS Security Version 1.2 was adopted in 2024. This revision strengthens support for long runnings systems adding new cryptographic algorithms, certificate revocation, and hardness against DoS attacks.
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
Transform Your Communication with Cloud-Based IVR SolutionsTheSMSPoint
Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfUndress Baby
The quest for the best AI face swap solution is marked by an amalgamation of technological prowess and artistic finesse, where cutting-edge algorithms seamlessly replace faces in images or videos with striking realism. Leveraging advanced deep learning techniques, the best AI face swap tools meticulously analyze facial features, lighting conditions, and expressions to execute flawless transformations, ensuring natural-looking results that blur the line between reality and illusion, captivating users with their ingenuity and sophistication.
Web:- https://undressbaby.com/
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus NagiosGrapher by Daniel Borkmann
1. Anomaly detection and trend analysis of critical
system states based on Nagios payload
Daniel Borkmann <dborkmann@acm.org>
(Open Source Monitoring Conference 2009)
Leipzig University of Applied Science, Faculty of Computer Science,
Mathematics, and Natural Sciences
Max Planck Institute for Human Cognitive and Brain Sciences, IT Department
October 28, 2009
3. Worst-Case: User detects failure within system and reports to
IT-Hotline (frustration on both sides)
Best-Case: Failure is detected and fixed before user notices
anything (frustration on only one side)
4. We face the situation ...
Nagios can only detect failures via static thresholds1
Is perfect for requesting boolean values (e.g. host is up or not)
But not so much in combination with numerical data from
certain processes ...
... process behaviour changes over the time, our static
thresholds do not
... what if behaviour is anomalous within thresholds?
Network traffic,
# Mail users,
Temperature sensors, ...
1
OK, WARNING, CRITICAL, UNKNOWN
5. We face the situation ...
Can we predict the future?2
Nagios has no possibility for payload extrapolation (trend)
Possible Use Cases
Software licences (e.g. Matlab),
Network: volume-based restrictions,
HD capacity,
Stock quotes ;)
2
... rhetorical question
6. 1 Data collection
Sniffing bytes over the net ... netsniff-ng
Integration into Nagios
2 Data analysis
Assumptions about our data ...
Anomaly detection techniques
Holt-Winters-Forecasting (and some improvements)
Clusteranalysis approach
Interface to NagiosGrapher
Trend analysis
Levenberg-Marquardt
REA-Framework
3 Links & Q+A
8. Data collection Data analysis Links & Q+A
Data collection
For our analysis we need some numerical data ...
... as an example, we fetch network packets and generate
statistics
Let’s have a short look ...
Daniel Borkmann Anomaly detection and trend analysis
10. Data collection Data analysis Links & Q+A
tcpdump, libpcap
Lets evaluate what could be used ...
tcpdump is a userspace network sniffer based on libpcap
libpcap: libpcap is a system-independent interface for
user-level packet capture. libpcap provides a portable
framework for low-level network monitoring. Applications
include network statistics collection, security monitoring,
network debugging, etc.3
3
http://sourceforge.net/projects/libpcap/
Daniel Borkmann Anomaly detection and trend analysis
11. Data collection Data analysis Links & Q+A
tcpdump, libpcap
libpcap programming isn’t hard, so here we go ... writing a
sniffer for Nagios based on libpcap
Sounds great, doesn’t it?
Daniel Borkmann Anomaly detection and trend analysis
13. Data collection Data analysis Links & Q+A
tcpdump, libpcap
For every incoming frame a recvfrom-Syscall is executed
What does that mean?
recvfrom(...)
Buffer will be copied from Userspace to Kernelspace4
Context switch (done by Scheduler / Dispatcher)
Buffer will be copied from Kernelspace back to Userspace5
Context switch (done by Scheduler / Dispatcher)
... very time intensive if done for each frame!
... possible frame drops for socket during high traffic
4
copy from user()
5
copy to user()
Daniel Borkmann Anomaly detection and trend analysis
14. Data collection Data analysis Links & Q+A
netsniff-ng
High performance network sniffer
Consists of
netsniff-ng
check packets (client for Nagios)
Daniel Borkmann Anomaly detection and trend analysis
15. Data collection Data analysis Links & Q+A
netsniff-ng features
The sniffer itself ...
Runs in promiscuous mode
Bypasses the complete network stack
Uses Kernelspace Berkeley Packet Filter (BPF)
Allocates 128 MB or less (probing) Kernelspace Receive Ring
(RX RING)
Ring is Memory-Mapped into Userspace (so no Syscalls like
recvfrom() needed → Zero-Copy)
Branchfree critical path (so we won’t smash the Pipeline)
Tested on Gigabit without packet loss
Can be run as Sysdaemon (silent, creates UDS
Server for communication) or in foreground
Daniel Borkmann Anomaly detection and trend analysis
16. Data collection Data analysis Links & Q+A
netsniff-ng features
netsniff-ng -d eth0 -f /etc/netsniff-ng/rules/arp.bpf
-C
strace again, looks better now ...
rt_sigprocmask(SIG_BLOCK, [USR1 ALRM], NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [USR1 ALRM], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
write(2, "I: ", 3I: ) = 3
write(2, "60 bytes from 00:1a:4d:45:7c:89 "..., 5360 bytes from 00:
1a:4d:45:7c:89 to ff:ff:ff:ff:ff:ff) = 53
rt_sigprocmask(SIG_BLOCK, [USR1 ALRM], NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [USR1 ALRM], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
write(2, "I: ", 3I: ) = 3
write(2, "60 bytes from 00:10:5a:d8:9a:a4 "..., 5360 bytes from 00:10:
5a:d8:9a:a4 to ff:ff:ff:ff:ff:ff) = 53
Daniel Borkmann Anomaly detection and trend analysis
17. Data collection Data analysis Links & Q+A
netsniff-ng features
netsniff-ng -d eth0 -f /etc/netsniff-ng/rules/icmp.bpf
ICMP-Flooding; only 0-3% CPU usage of netsniff-ng during tests
I: elapsed time: 0 d, 0 h, 4 min, 45 s
I: -----------+------------------+------------------+------------------
I: | per sec | per min | total
I: -----------+------------------+------------------+------------------
I: frames | 80201 | 4696273 | 20149411
I: -----------+------------------+------------------+------------------
I: in B | 119622775 | 7003546464 | 30048067864
I: in KB | 116819 | 6839400 | 29343816
I: in MB | 114 | 6679 | 28656
I: in GB | 0 | 6 | 27
I: -----------+------------------+------------------+------------------
[...]
I: 23724545 frames incoming
I: 23724545 frames passed filter
I: 0 frames failed filter (due to out of space)
I: captured frames: 23724545, captured bytes:
35377999772 [34548827 KB, 33739 MB, 32 GB]
Daniel Borkmann Anomaly detection and trend analysis
18. Data collection Data analysis Links & Q+A
check packets features
Is a Unix Domain Socket Client for netsniff-ng
Fetches collected network statistics at runtime via UDS inode
-n option for creating Nagios one-liner → Performance data
Simple Nagios integration with NRPE or check by ssh
Daniel Borkmann Anomaly detection and trend analysis
20. Data collection Data analysis Links & Q+A
Integration into Nagios
Runs 24x7 on MPI router (DMZ ⇔ Router ⇔ Internet)
6 instances running with following BPFs:
All
HTTP
Broadcast
SSH
Webserver
Mailserver traffic
Data will be passed via check by ssh6
(Port 22) through our internal
Firewall to our Nagios server
Visualization realized by NagiosGrapher via
Round Robin Databases
6
NRPE inappropriate in our case
Daniel Borkmann Anomaly detection and trend analysis
21. Data collection Data analysis Links & Q+A
Some results
({Sec|}IMAP, {Sec|}SMTP, POP3, HTTP traffic to our Zimbra
mailserver)
Daniel Borkmann Anomaly detection and trend analysis
22. Data collection Data analysis Links & Q+A
Some results
(Broadcast traffic)
Daniel Borkmann Anomaly detection and trend analysis
25. Data collection Data analysis Links & Q+A
Assumptions
Generally our monitored processes equal Stochastic processes
Incoming anomalies itself are regarded as Poisson processes
A single data point has ...
Baseline (
”
intercept“) or irregular component
Linear trend (
”
slope“) component
Seasonal trend (at least one)
We assume components are additive
Daniel Borkmann Anomaly detection and trend analysis
28. Data collection Data analysis Links & Q+A
Holt-Winters-Forecasting
Based on exponential smoothing (recent data have higher
weights than older ones): ˆyt+1 = αyt + (1 − α)ˆyt with
0 < α < 1
Decomposition of time series into components with triple
exponential smoothing technique
Multiplicative and additive versions
Forecast of the near future (with +/− deviation), check if
actual value falls into expected interval (if not: anomaly after
breaking thresholds)
Daniel Borkmann Anomaly detection and trend analysis
29. Data collection Data analysis Links & Q+A
Holt-Winters-Forecasting
What the math model looks like:
Prediction:
ˆyt+1 = at + bt + ct+1−m
at = α(yt − ct−m) + (1 − α)(at−1 + bt−1) Baseline,
bt = β(at − at−1) + (1 − β)bt−1 Linear trend,
ct = γ(yt − at) + (1 − γ)ct−m Seasonal trend
0 < α, β, γ < 1
m: Len season
Confidence band coefficient:
dt = γ|yt − ˆyt| + (1 − γ)dt−m
Confidence band itself:
I(ˆyt − δ− ∗ dt−m, ˆyt + δ+ ∗ dt−m) 2 < δ−, δ+ < 3
Sliding window as threshold for
anomalies
Daniel Borkmann Anomaly detection and trend analysis
30. Data collection Data analysis Links & Q+A
Holt-Winters-Forecasting
Possible improvements:
Double seasonality (periods of week, day)
Up to now only single seasonality (day) → parameter
disturbances of passing business days into non-business days
and vice versa
Parameter stabilization via Kalman filter [Gelper et al. 2007]
Will make parameters more robust towards interferences of
anomalies
Miller extension for improving prediction of low traffic
values [Miller 2007]
Longterm zero-values decreases confidence
band to minimal width → slight
payload variance will result in anomaly
even though it isn’t
Daniel Borkmann Anomaly detection and trend analysis
31. Data collection Data analysis Links & Q+A
Holt-Winters-Forecasting, Results (all traffic)
Daniel Borkmann Anomaly detection and trend analysis
32. Data collection Data analysis Links & Q+A
Holt-Winters-Forecasting, Results (active Zimbra users)
Daniel Borkmann Anomaly detection and trend analysis
34. Data collection Data analysis Links & Q+A
Clusteranalysis approach
Idea: one-dimensional grouping of data points to interval
bands in order to separate anomalous from normal behaviour
Segregation of business days and non-business days
Grouping threshold set by standard deviation of Poisson
Process (adapted with coeffcient)
Groups consist of at least 25 per cent of data points (as per
definition)
Daniel Borkmann Anomaly detection and trend analysis
35. Data collection Data analysis Links & Q+A
Clusteranalysis approach
Daniel Borkmann Anomaly detection and trend analysis
36. Data collection Data analysis Links & Q+A
Clusteranalysis, Results (SSH traffic)
Daniel Borkmann Anomaly detection and trend analysis
37. Data collection Data analysis Links & Q+A
Clusteranalysis, Results (Broadcast traffic)
Daniel Borkmann Anomaly detection and trend analysis
39. Data collection Data analysis Links & Q+A
Interface to NagiosGrapher
Three thread queues filled during NGs collect2.pl runtime
(patched against nagiosgrapher (1.6.1rc5-6), Debian Lenny
(stable))
Detached threads check for anomalies according to given
algorithm (start baseline.pl delegate after RRD update)
Module Baseline::Algorithm::CalcRRD with calc rrd() and
check baseline() implemented for specific algorithm
Daniel Borkmann Anomaly detection and trend analysis
40. Data collection Data analysis Links & Q+A
Interface to NagiosGrapher
Algorithms swappable during runtime without payload loss!
Anomaly information on Nagios ‘Service Detail‘ page
Baseline::Util::Routines contain helper routines for CalcRRD
implementation (as fetch stepping(), fetch lastupdate(),
fetch table(), ...)
Daniel Borkmann Anomaly detection and trend analysis
41. Data collection Data analysis Links & Q+A
Interface to NagiosGrapher
Daniel Borkmann Anomaly detection and trend analysis
42. Data collection Data analysis Links & Q+A
NagiosGrapher GUI, config integration
define ngraph{
service_name Netsniff *
graph_log_regex .*persminute:s+([0-9.]+)sframes
graph_value pktfr
graph_units frames/min
graph_legend frames per min
rrd_plottype LINE2
rrd_color FF0000
hide no
page data
}
define ngraph{
service_name Netsniff *
type BASELINE <-- unknown, will be ignored within perfdata parsing
graph_value pktfrbase
graph_units health
graph_legend health temperature
rrd_plottype LINE2
rrd_color 00ff00
hide no
page health_frames
}
Daniel Borkmann Anomaly detection and trend analysis
43. Data collection Data analysis Links & Q+A
Interface to NagiosGrapher
Two (or more) pages per Service (‘data‘, ‘health *‘)
Actual data (payload, anomaly) still separated
Daniel Borkmann Anomaly detection and trend analysis
44. Data collection Data analysis Links & Q+A
Interface to NagiosGrapher
Integration into Nagios ‘Service Detail‘ page
check baseline() part of Baseline::Algorithm::CalcRRD
called
Daniel Borkmann Anomaly detection and trend analysis
47. Data collection Data analysis Links & Q+A
Levenberg-Marquardt
Non-linear fitting algorithm
Fits a math model (e.g. f (x) := acos(bx) + bsin(ax) + c) into
a series of data points with minimal residuals (parameter
search → {a, b, c})
More ‘stable‘ in finding local minimum (even with a bad
chosen start vector) than Gauss-Newton method (with
‘lambda decay‘)
Non-linearity approximated by iterative solving of linear
equation systems (Taylor series)
Note: quality of the fit depends on your
defined model!
Daniel Borkmann Anomaly detection and trend analysis
49. Data collection Data analysis Links & Q+A
REA-Framework
NG-independet script collection (works with all possible RRDs)
generate.pl
Called via nightly cronjob
Automatically extracts RRD payload (extract raw.pl) →
triple: (timestamp, normalized idx, data value)
Runs configured plugins via gnuplot processor
Generates graph images (if possible) and copies them into
WWW-dir
predict.cgi
User interface
User assigns configured ‘subject‘ (e.g. host service) to ‘analysis
type‘ (e.g. monomial extrapolation)
All RRA specific graphs with chosen fits
are shown
Daniel Borkmann Anomaly detection and trend analysis
50. Data collection Data analysis Links & Q+A
REA-Framework
Daniel Borkmann Anomaly detection and trend analysis
51. Data collection Data analysis Links & Q+A
REA-Framework, plugin for Monomial extrapolation
#!/usr/bin/gnuplot
FIT_LIMIT = 1e-6
FIT_MAXITER = 80
f(x) = a * (x - b)**n + c
fit f(x) "exp.dat" using 2:3 via a, b, c, n
set grid
set title "RRD time series extrapolation"
set timefmt "%s"
set xdata time
set format x "%d.%m.%Y, %H/%M"
set xlabel " "
set ylabel " "
set xtics rotate by 90 scale 0
set key below
plot "< cat exp.dat fore.dat" using 1:(f($2)) title "best ’a * (x - b)^n + c’ fit" with lines,
"exp.dat" using 1:3 title "time series data" with points
set term png size 1024, 768
set output "plot.png"
replot
set term dumb
Daniel Borkmann Anomaly detection and trend analysis
52. Data collection Data analysis Links & Q+A
REA-Framework, frontend
Daniel Borkmann Anomaly detection and trend analysis
53. Data collection Data analysis Links & Q+A
REA-Framework, Results (Traffic to webserver, KB/min)
Daniel Borkmann Anomaly detection and trend analysis
54. Data collection Data analysis Links & Q+A
REA-Framework, Results (Broadcast traffic, Frames/min)
Daniel Borkmann Anomaly detection and trend analysis
55. Data collection Data analysis Links & Q+A
REA-Framework, Results (SSH traffic, Frames/min)
Daniel Borkmann Anomaly detection and trend analysis
56. Data collection Data analysis Links & Q+A
Acknowledgement
Dr. Helmut Hayd, Max Planck Institute for Human Cognitive
and Brain Sciences
Prof. Dr.-Ing. Dietmar Reimann, Leipzig University of Applied
Science, Faculty of Computer Science, Mathematics, and
Natural Sciences
Daniel Borkmann Anomaly detection and trend analysis
57. Data collection Data analysis Links & Q+A
Links
Thesis: http://edoc.mpg.de/437809
netsniff-ng: http://netsniff-ng.googlecode.com
All the rest8: <dborkmann@acm.org>
8
It’s still a prototype ...
Daniel Borkmann Anomaly detection and trend analysis
58. Data collection Data analysis Links & Q+A
Q+A?
Daniel Borkmann Anomaly detection and trend analysis