In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world, or a stream of web requests by a monitoring / crawling service in the web world, or may be a hacker's bot sitting on internet and attacking various computers. Although we may not be able to know the motive or intention behind these data sources. But via some unsupervised techniques we can try to infer the pattern or correlate the events based on their multiple occurrences on the axis of time. Associating a chain of events in order of time helps in doing a root event analysis. In certain cases a time ordered correlation and root event identification is good enough to automatically identify signatures of various malicious actors and take appropriate corrective actions to stop cyber attacks, stop malicious social campaigns, etc.
Sessionisation is one such unsupervised technique that tries to find the signal in a stream of events associated with a timestamp. In the ideal world it would resolve to finding periods with a mixture of sinusoidal waves. But for the real world this is a much complex activity, as even the systematic events generated by machines over the internet behave in a much erratic manner. So the notion of a period for a signal also changes in the real world. We can no longer associate it with a number, it has to be treated as a random variable, with expected values and associated variance. Hence we need to model "Stochastic periods" and learn their probability distributions in an unsupervised manner.
The main focus of this talk will be to showcase applied data science techniques to discover stochastic periods. There are many ways to obtain periods in data, so the journey would begin by a walk through of existing techniques like FFT (Fast Fourier Transform) then discuss about Gaussian Mixture Models. After highlighting the short comings of these techniques we will succinctly explain one of the most general non-parametric Bayesian approaches to solve this problem. Without going too deep in the complex math, we will get back to applied data science and discuss a much simpler technique that can solve the same problem if certain assumptions are satisfied.
In this talk we will demonstrate some time based pattern we discovered while working on a security analytics use case that uses Sessionisation. In the talk we will demonstrate such patterns based on an open source malware attack datasets that is available publicly.
Key concepts explained in talk: Sessionisation, Bayesian techniques of Machine Learning, Gaussian Mixture Models, Kernel density estimation, FFT, stochastic periods, probabilistic modelling, Bayesian non-parametric methods
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...Matteo Ferroni
Tools and applications for event stream processing and real-time analytics are getting a huge hype these days on a wide range of application scenarios, from the smallest Internet of Things (IoT) embedded sensor to the most popular Social Network feed. Unfortunately, dealing with this kind of input rises some issues that can easily mine the real-time analysis requirement due to an unexpected overload of the system; this happens as the processing time may strongly depend on the single event content, while the event arrival rate may vary unpredictably over time. In this work, we propose Fast Forward With Degradation (FFWD), a latency-aware load shedding framework that exploits performance degradation techniques to adapt the throughput of the application to the size of the input, allowing the system to have a fast and reliable response time in case of overloading. Moreover, we show how different domain-specific policies can guarantee a reasonable accuracy of the aggregated output metrics.
Full paper: http://ieeexplore.ieee.org/document/7982234/
This document discusses speaker diarization, which is the process of segmenting an audio stream into homogeneous segments according to speaker identity. It covers feature extraction methods like MFCCs, segmentation using Bayesian Information Criteria to compare Gaussian mixture models, and clustering algorithms like k-means and hierarchical agglomerative clustering. Dendrogram visualizations are used to identify natural speaker clusters. The overall goal is to partition audio recordings of discussions or debates into homogeneous segments to attribute speech segments to individual speakers.
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
This document describes research on implementing Curran's approximation algorithm for pricing Asian options using a dataflow architecture. The algorithm was implemented on a Maxeler dataflow engine (DFE) and compared to a CPU implementation. Different fixed-point precisions were tested on the DFE and 54-bit fixed-point provided the best balance of precision and resource usage. Implementing the algorithm across multiple DFEs provided speedups of 5-12x over a 48-core CPU. Further optimization of dynamic ranges allowed increasing the unrolling factor, improving performance and energy efficiency.
Self-adaptive container monitoring with performance-aware Load-Shedding policies, by Rolando Brondolin, PhD student in System Architecture at Politecnico di Milano
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...NETWAYS
Mittels statisch festgelegten Thresholds als Verhaltensgrenzen bietet Nagios nur bedingt geeignete Mittel, um ein sich über die Zeit änderndes Verhalten auf Anomalien hin zu untersuchen. Lediglich der sog. Holt-Winters Forecasting Algorithmus wurde in Nagios von Jake D. Brutlag integriert, welcher aber in einigen Fällen an seine Grenzen kommt. Viel mehr Möglichkeiten einer solchen Verhaltensanalyse werden weder durch Nagios, noch durch NagiosGrapher bereitgestellt, sodass die Idee aufkam, eine allgemeine Schnittstelle (Prototyp) in NagiosGrapher zu integrieren, um so verschiedene Algorithmen zur Extrapolation und Anomalieanalyse auszutesten und einfach auszuwechseln.
Dabei wird im Vortrag auf den Aufbau und die Funktionsweise der Schittellen eingegangen, sowie einige Methoden zur Verhaltensanalyse erläutert. Hierbei werden RRD-Daten verwendet, um Baselines, Abweichungen von diesen und deren Extrapolation zu berechnen. Infos unter: gnumaniacs.org
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...Matteo Ferroni
Tools and applications for event stream processing and real-time analytics are getting a huge hype these days on a wide range of application scenarios, from the smallest Internet of Things (IoT) embedded sensor to the most popular Social Network feed. Unfortunately, dealing with this kind of input rises some issues that can easily mine the real-time analysis requirement due to an unexpected overload of the system; this happens as the processing time may strongly depend on the single event content, while the event arrival rate may vary unpredictably over time. In this work, we propose Fast Forward With Degradation (FFWD), a latency-aware load shedding framework that exploits performance degradation techniques to adapt the throughput of the application to the size of the input, allowing the system to have a fast and reliable response time in case of overloading. Moreover, we show how different domain-specific policies can guarantee a reasonable accuracy of the aggregated output metrics.
Full paper: http://ieeexplore.ieee.org/document/7982234/
This document discusses speaker diarization, which is the process of segmenting an audio stream into homogeneous segments according to speaker identity. It covers feature extraction methods like MFCCs, segmentation using Bayesian Information Criteria to compare Gaussian mixture models, and clustering algorithms like k-means and hierarchical agglomerative clustering. Dendrogram visualizations are used to identify natural speaker clusters. The overall goal is to partition audio recordings of discussions or debates into homogeneous segments to attribute speech segments to individual speakers.
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
This document describes research on implementing Curran's approximation algorithm for pricing Asian options using a dataflow architecture. The algorithm was implemented on a Maxeler dataflow engine (DFE) and compared to a CPU implementation. Different fixed-point precisions were tested on the DFE and 54-bit fixed-point provided the best balance of precision and resource usage. Implementing the algorithm across multiple DFEs provided speedups of 5-12x over a 48-core CPU. Further optimization of dynamic ranges allowed increasing the unrolling factor, improving performance and energy efficiency.
Self-adaptive container monitoring with performance-aware Load-Shedding policies, by Rolando Brondolin, PhD student in System Architecture at Politecnico di Milano
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...NETWAYS
Mittels statisch festgelegten Thresholds als Verhaltensgrenzen bietet Nagios nur bedingt geeignete Mittel, um ein sich über die Zeit änderndes Verhalten auf Anomalien hin zu untersuchen. Lediglich der sog. Holt-Winters Forecasting Algorithmus wurde in Nagios von Jake D. Brutlag integriert, welcher aber in einigen Fällen an seine Grenzen kommt. Viel mehr Möglichkeiten einer solchen Verhaltensanalyse werden weder durch Nagios, noch durch NagiosGrapher bereitgestellt, sodass die Idee aufkam, eine allgemeine Schnittstelle (Prototyp) in NagiosGrapher zu integrieren, um so verschiedene Algorithmen zur Extrapolation und Anomalieanalyse auszutesten und einfach auszuwechseln.
Dabei wird im Vortrag auf den Aufbau und die Funktionsweise der Schittellen eingegangen, sowie einige Methoden zur Verhaltensanalyse erläutert. Hierbei werden RRD-Daten verwendet, um Baselines, Abweichungen von diesen und deren Extrapolation zu berechnen. Infos unter: gnumaniacs.org
Logical clocks assign sequence numbers to distributed system events to determine causality without a global clock. Lamport's algorithm uses logical clocks to impose a partial ordering on events. Vector clocks extend this to also detect concurrent events that are not causally related, providing a full happened-before relation between all events. Each process maintains a vector clock that is incremented after local events and updated when receiving messages from other processes.
This paper proposes a multiple query optimization (MQO) scheme for change point detection (CPD) that can significantly reduce the number of operators needed. CPD is used to detect anomalies in time series data but requires tuning parameters, which leads to running multiple CPDs with different parameters. The paper identifies four patterns for sharing CPD operators between queries based on whether parameter values are the same. Experiments show the proposed MQO approach reduces the number of operators by up to 80% compared to running each CPD independently, thus improving performance. Integrating MQO with hardware accelerators is suggested as future work.
The Case for a Signal Oriented Data Stream Management SystemReza Rahimi
This document proposes a signal-oriented data stream management system called WaveScope. It discusses typical applications involving sensor networks, the data and programming model using a domain-specific language called WaveScript, and the system architecture involving query planning, optimization, and distributed execution. Key aspects include managing timing information across different timebases, optimizing queries using both database and signal processing techniques, and supporting archived historical data retrieval.
This document discusses scheduling in distributed systems. It covers:
1) Common scheduling techniques like min-min, max-min, and sufferage for scheduling independent tasks on dedicated systems.
2) Scheduling dependent tasks modeled as directed acyclic graphs (DAGs) using techniques like critical path on a processor (CPOP) and heterogeneous earliest finish time (HEFT).
3) The need for scheduling algorithms to adapt to dynamic grid environments where tasks may have dependencies on shared files and network transfer times vary.
This document describes fast single-pass k-means clustering algorithms. It discusses the rationale for using k-means clustering to enable fast search over large datasets. The document outlines ball k-means and surrogate clustering algorithms that can cluster data in a single pass. It discusses how these algorithms work and their implementation, including using locality sensitive hashing and projection searches to speed up clustering over high-dimensional data. Evaluation results show these algorithms can accurately cluster data much faster than traditional k-means approaches. The applications of these fast clustering algorithms include enabling fast nearest neighbor searches over large customer datasets for applications like marketing and fraud prevention.
This document provides an overview of deep deterministic policy gradient (DDPG), which combines aspects of DQN and policy gradient methods to enable deep reinforcement learning with continuous action spaces. It summarizes DQN and its limitations for continuous domains. It then explains policy gradient methods like REINFORCE, actor-critic, and deterministic policy gradient (DPG) that can handle continuous action spaces. DDPG adopts key elements of DQN like experience replay and target networks, and models the policy as a deterministic function like DPG, to apply deep reinforcement learning to complex continuous control tasks.
This document discusses logical clocks and vector clocks for capturing causality in distributed systems. It defines causal order and shows how Lamport logical clocks assign timestamps to events to preserve causal order, with timestamps providing a partial order. Vector clocks also preserve causal order and provide a partial order, but additionally guarantee that if one event causally precedes another, it will have a smaller vector timestamp.
Logical clocks are mechanisms for capturing chronological and causal relationships in distributed systems. Lamport introduced logical clocks using timestamps assigned to events to define a "happens before" relation between events. Vector clocks extend logical timestamps to capture causality more accurately by maintaining a vector of timestamps, with one entry per process. Matrix clocks further extend this idea by maintaining a matrix to represent processes' knowledge of other processes' logical clocks.
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf
Large-scale Machine Learning: Deep, Distributed and Multi-Dimensional:
Modern machine learning involves deep neural network architectures which yields state-of-art performance on multiple domains such as computer vision, natural language processing and speech recognition. As the data and models scale, it becomes necessary to have multiple processing units for both training and inference. Apache MXNet is an open-source framework developed for distributed deep learning. I will describe the underlying lightweight hierarchical parameter server architecture that results in high efficiency in distributed settings.
Pushing the current boundaries of deep learning requires using multiple dimensions and modalities. These can be encoded into tensors, which are natural extensions of matrices. We present new deep learning architectures that preserve the multi-dimensional information in data end-to-end. We show that tensor contractions and regression layers are an effective replacement for fully connected layers in deep learning architectures. They result in significant space savings with negligible performance degradation. These functionalities are available in the Tensorly package with MXNet backend interface for large-scale efficient learning.
Bio: Anima Anandkumar is a principal scientist at Amazon Web Services and a Bren professor at Caltech CMS department. Her research interests are in the areas of large-scale machine learning, non-convex optimization and high-dimensional statistics. In particular, she has been spearheading the development and analysis of tensor algorithms. She is the recipient of several awards such as the Alfred. P. Sloan Fellowship, Microsoft Faculty Fellowship, Google research award, ARO and AFOSR Young Investigator Awards, NSF Career Award, Early Career Excellence in Research Award at UCI, Best Thesis Award from the ACM Sigmetrics society, IBM Fran Allen PhD fellowship, and several best paper awards. She has been featured in a number of forums such as the yourstory, Quora ML session, O’Reilly media, and so on. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She was a postdoctoral researcher at MIT from 2009 to 2010, an assistant professor at U.C. Irvine between 2010 and 2016, and a visiting researcher at Microsoft Research New England in 2012 and 2014.
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
The document presents new methods for leveraging bagging for evolving data streams. It discusses using randomization techniques like Poisson distributions for input data and random output codes to increase diversity among classifiers. Experimental results on data streams with concept drift show the proposed methods like Leveraging Bagging and Leveraging Bagging MC improve accuracy over baselines like Hoeffding Trees and Online Bagging, while methods like Leveraging Bagging ME reduce RAM-Hours usage. The paper aims to improve accuracy and resource usage for data stream mining under concept drift.
Classifying Multi-Variate Time Series at Scale:
Characterizing and understanding the runtime behavior of large scale Big Data production systems is extremely important. Typical systems consist of hundreds to thousands of machines in a cluster with hundreds of terabytes of storage costing millions of dollars, solving problems that are business critical. By instrumenting each running process, and measuring their resource utilization including CPU, Memory, I/O, network etc., as time series it is possible to understand and characterize the workload on these massive clusters. Each time series is a series consisting of tens to tens of thousands of data points that must be ingested and then classified. At Pepperdata, our instrumentation of the clusters collects over three hundred metrics from each task every five seconds resulting in millions of data points per hour. At this scale the data are equivalent to the biggest IOT data sets in the world. Our objective is to classify the collection of time series into a set of classes that represent different work load types. Or phrased differently, our problem is essentially the problem of classifying multivariate time series.
In this talk, we propose a unique, off-the-shelf approach to classifying time series that achieves near best-in-class accuracy for univariate series and generalizes to multivariate time series. Our technique maps each time series to a Grammian Angular Difference Field (GADF), interprets that as an image, uses Google’s pre-trained CNN (trained on Inception v3) to map the GADF images into a 2048-dimensional vector space and then uses a small MLP with two hidden layers, with fifty nodes in each layer, and a softmax output to achieve the final classification. Our work is not domain specific – a fact proven by our achieving competitive accuracies with published results on the univariate UCR data set as well as the multivariate UCI data set.
Bio: Before joining Pepperdata, Ash was executive chairman for Marianas Labs, a deep learning startup sold in December 2015. Prior to that he was CEO for Graphite Systems, a big data storage startup that was sold to EMC DSSD in August 2015. Munshi also served as CTO of Yahoo, as a CEO of both public and private companies, and is on the board of several technology startups.
Improving Numerical Wave Forecasts by Data Assimilation Based on Neural NetworksAditya N Deshmukh
This document presents a study that uses artificial neural networks (ANNs) to improve numerical wave forecasts of significant wave height (Hs) and peak wave period (Tp) through data assimilation. The study uses ANNs to predict errors between numerical model outputs and buoy observations, then corrects the numerical forecasts by adding/subtracting the predicted errors. Results show the ANN-corrected numerical forecasts have improved agreement with observations compared to the original numerical forecasts or standalone ANN models, especially for 6-hour and 24-hour ahead predictions of Hs across different seasons.
Dueling network architectures for deep reinforcement learningTaehoon Kim
1. The document proposes a dueling network architecture for deep reinforcement learning that separately estimates state value and state-dependent action advantages without extra supervision.
2. It introduces a dueling deep Q-network that uses a single network with two streams - one that produces a state value and the other that produces state-dependent action advantages, which are then combined to estimate the state-action value function.
3. Experiments on Atari games show that the dueling network outperforms traditional deep Q-networks, achieving better performance in both random starts and starts from human demonstrations.
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
The document proposes using perceptron learners at the leaves of Hoeffding decision trees to improve performance on data streams. It introduces a new evaluation metric called RAM-Hours that considers both time and memory usage. The authors empirically evaluate different classifier models, including Hoeffding trees with perceptron and naive Bayes learners at leaves, on several datasets. Results show that hybrid models like Hoeffding naive Bayes perceptron trees often provide the best balance of accuracy, time and memory usage.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
This document describes MIST, a system for large-scale IoT stream processing. MIST uses a cluster of machines to efficiently handle billions of IoT stream queries. It provides query APIs that allow users to define dataflow and complex event processing queries. MIST optimizes processing by sharing code, exploiting locality of code references through query grouping, and merging queries to reuse system resources.
Using CNTK's Python Interface for Deep LearningDave DeBarr - PyData
The document provides an overview of using CNTK (Cognitive Toolkit) for deep learning in Python. It discusses topics like machine learning, deep learning, neural networks, gradient descent, and examples using logistic regression and multi-layer perceptrons. It also covers installing CNTK and related tools on Azure virtual machines to access GPUs for faster computation. Key steps outlined are downloading required software, configuring the Nvidia driver, running examples notebooks, and the basic principles of backpropagation for training neural networks.
Safe and Efficient Off-Policy Reinforcement Learningmooopan
This document summarizes the Retrace(λ) reinforcement learning algorithm presented by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare. Retrace(λ) is an off-policy multi-step reinforcement learning algorithm that is safe (converges for any policy), efficient (makes best use of samples when policies are close), and has lower variance than importance sampling. Empirical results on Atari 2600 games show Retrace(λ) outperforms one-step Q-learning and existing multi-step methods.
Logical clocks assign sequence numbers to distributed system events to determine causality without a global clock. Lamport's algorithm uses logical clocks to impose a partial ordering on events. Vector clocks extend this to also detect concurrent events that are not causally related, providing a full happened-before relation between all events. Each process maintains a vector clock that is incremented after local events and updated when receiving messages from other processes.
This paper proposes a multiple query optimization (MQO) scheme for change point detection (CPD) that can significantly reduce the number of operators needed. CPD is used to detect anomalies in time series data but requires tuning parameters, which leads to running multiple CPDs with different parameters. The paper identifies four patterns for sharing CPD operators between queries based on whether parameter values are the same. Experiments show the proposed MQO approach reduces the number of operators by up to 80% compared to running each CPD independently, thus improving performance. Integrating MQO with hardware accelerators is suggested as future work.
The Case for a Signal Oriented Data Stream Management SystemReza Rahimi
This document proposes a signal-oriented data stream management system called WaveScope. It discusses typical applications involving sensor networks, the data and programming model using a domain-specific language called WaveScript, and the system architecture involving query planning, optimization, and distributed execution. Key aspects include managing timing information across different timebases, optimizing queries using both database and signal processing techniques, and supporting archived historical data retrieval.
This document discusses scheduling in distributed systems. It covers:
1) Common scheduling techniques like min-min, max-min, and sufferage for scheduling independent tasks on dedicated systems.
2) Scheduling dependent tasks modeled as directed acyclic graphs (DAGs) using techniques like critical path on a processor (CPOP) and heterogeneous earliest finish time (HEFT).
3) The need for scheduling algorithms to adapt to dynamic grid environments where tasks may have dependencies on shared files and network transfer times vary.
This document describes fast single-pass k-means clustering algorithms. It discusses the rationale for using k-means clustering to enable fast search over large datasets. The document outlines ball k-means and surrogate clustering algorithms that can cluster data in a single pass. It discusses how these algorithms work and their implementation, including using locality sensitive hashing and projection searches to speed up clustering over high-dimensional data. Evaluation results show these algorithms can accurately cluster data much faster than traditional k-means approaches. The applications of these fast clustering algorithms include enabling fast nearest neighbor searches over large customer datasets for applications like marketing and fraud prevention.
This document provides an overview of deep deterministic policy gradient (DDPG), which combines aspects of DQN and policy gradient methods to enable deep reinforcement learning with continuous action spaces. It summarizes DQN and its limitations for continuous domains. It then explains policy gradient methods like REINFORCE, actor-critic, and deterministic policy gradient (DPG) that can handle continuous action spaces. DDPG adopts key elements of DQN like experience replay and target networks, and models the policy as a deterministic function like DPG, to apply deep reinforcement learning to complex continuous control tasks.
This document discusses logical clocks and vector clocks for capturing causality in distributed systems. It defines causal order and shows how Lamport logical clocks assign timestamps to events to preserve causal order, with timestamps providing a partial order. Vector clocks also preserve causal order and provide a partial order, but additionally guarantee that if one event causally precedes another, it will have a smaller vector timestamp.
Logical clocks are mechanisms for capturing chronological and causal relationships in distributed systems. Lamport introduced logical clocks using timestamps assigned to events to define a "happens before" relation between events. Vector clocks extend logical timestamps to capture causality more accurately by maintaining a vector of timestamps, with one entry per process. Matrix clocks further extend this idea by maintaining a matrix to represent processes' knowledge of other processes' logical clocks.
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf
Large-scale Machine Learning: Deep, Distributed and Multi-Dimensional:
Modern machine learning involves deep neural network architectures which yields state-of-art performance on multiple domains such as computer vision, natural language processing and speech recognition. As the data and models scale, it becomes necessary to have multiple processing units for both training and inference. Apache MXNet is an open-source framework developed for distributed deep learning. I will describe the underlying lightweight hierarchical parameter server architecture that results in high efficiency in distributed settings.
Pushing the current boundaries of deep learning requires using multiple dimensions and modalities. These can be encoded into tensors, which are natural extensions of matrices. We present new deep learning architectures that preserve the multi-dimensional information in data end-to-end. We show that tensor contractions and regression layers are an effective replacement for fully connected layers in deep learning architectures. They result in significant space savings with negligible performance degradation. These functionalities are available in the Tensorly package with MXNet backend interface for large-scale efficient learning.
Bio: Anima Anandkumar is a principal scientist at Amazon Web Services and a Bren professor at Caltech CMS department. Her research interests are in the areas of large-scale machine learning, non-convex optimization and high-dimensional statistics. In particular, she has been spearheading the development and analysis of tensor algorithms. She is the recipient of several awards such as the Alfred. P. Sloan Fellowship, Microsoft Faculty Fellowship, Google research award, ARO and AFOSR Young Investigator Awards, NSF Career Award, Early Career Excellence in Research Award at UCI, Best Thesis Award from the ACM Sigmetrics society, IBM Fran Allen PhD fellowship, and several best paper awards. She has been featured in a number of forums such as the yourstory, Quora ML session, O’Reilly media, and so on. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She was a postdoctoral researcher at MIT from 2009 to 2010, an assistant professor at U.C. Irvine between 2010 and 2016, and a visiting researcher at Microsoft Research New England in 2012 and 2014.
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
The document presents new methods for leveraging bagging for evolving data streams. It discusses using randomization techniques like Poisson distributions for input data and random output codes to increase diversity among classifiers. Experimental results on data streams with concept drift show the proposed methods like Leveraging Bagging and Leveraging Bagging MC improve accuracy over baselines like Hoeffding Trees and Online Bagging, while methods like Leveraging Bagging ME reduce RAM-Hours usage. The paper aims to improve accuracy and resource usage for data stream mining under concept drift.
Classifying Multi-Variate Time Series at Scale:
Characterizing and understanding the runtime behavior of large scale Big Data production systems is extremely important. Typical systems consist of hundreds to thousands of machines in a cluster with hundreds of terabytes of storage costing millions of dollars, solving problems that are business critical. By instrumenting each running process, and measuring their resource utilization including CPU, Memory, I/O, network etc., as time series it is possible to understand and characterize the workload on these massive clusters. Each time series is a series consisting of tens to tens of thousands of data points that must be ingested and then classified. At Pepperdata, our instrumentation of the clusters collects over three hundred metrics from each task every five seconds resulting in millions of data points per hour. At this scale the data are equivalent to the biggest IOT data sets in the world. Our objective is to classify the collection of time series into a set of classes that represent different work load types. Or phrased differently, our problem is essentially the problem of classifying multivariate time series.
In this talk, we propose a unique, off-the-shelf approach to classifying time series that achieves near best-in-class accuracy for univariate series and generalizes to multivariate time series. Our technique maps each time series to a Grammian Angular Difference Field (GADF), interprets that as an image, uses Google’s pre-trained CNN (trained on Inception v3) to map the GADF images into a 2048-dimensional vector space and then uses a small MLP with two hidden layers, with fifty nodes in each layer, and a softmax output to achieve the final classification. Our work is not domain specific – a fact proven by our achieving competitive accuracies with published results on the univariate UCR data set as well as the multivariate UCI data set.
Bio: Before joining Pepperdata, Ash was executive chairman for Marianas Labs, a deep learning startup sold in December 2015. Prior to that he was CEO for Graphite Systems, a big data storage startup that was sold to EMC DSSD in August 2015. Munshi also served as CTO of Yahoo, as a CEO of both public and private companies, and is on the board of several technology startups.
Improving Numerical Wave Forecasts by Data Assimilation Based on Neural NetworksAditya N Deshmukh
This document presents a study that uses artificial neural networks (ANNs) to improve numerical wave forecasts of significant wave height (Hs) and peak wave period (Tp) through data assimilation. The study uses ANNs to predict errors between numerical model outputs and buoy observations, then corrects the numerical forecasts by adding/subtracting the predicted errors. Results show the ANN-corrected numerical forecasts have improved agreement with observations compared to the original numerical forecasts or standalone ANN models, especially for 6-hour and 24-hour ahead predictions of Hs across different seasons.
Dueling network architectures for deep reinforcement learningTaehoon Kim
1. The document proposes a dueling network architecture for deep reinforcement learning that separately estimates state value and state-dependent action advantages without extra supervision.
2. It introduces a dueling deep Q-network that uses a single network with two streams - one that produces a state value and the other that produces state-dependent action advantages, which are then combined to estimate the state-action value function.
3. Experiments on Atari games show that the dueling network outperforms traditional deep Q-networks, achieving better performance in both random starts and starts from human demonstrations.
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
The document proposes using perceptron learners at the leaves of Hoeffding decision trees to improve performance on data streams. It introduces a new evaluation metric called RAM-Hours that considers both time and memory usage. The authors empirically evaluate different classifier models, including Hoeffding trees with perceptron and naive Bayes learners at leaves, on several datasets. Results show that hybrid models like Hoeffding naive Bayes perceptron trees often provide the best balance of accuracy, time and memory usage.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
This document describes MIST, a system for large-scale IoT stream processing. MIST uses a cluster of machines to efficiently handle billions of IoT stream queries. It provides query APIs that allow users to define dataflow and complex event processing queries. MIST optimizes processing by sharing code, exploiting locality of code references through query grouping, and merging queries to reuse system resources.
Using CNTK's Python Interface for Deep LearningDave DeBarr - PyData
The document provides an overview of using CNTK (Cognitive Toolkit) for deep learning in Python. It discusses topics like machine learning, deep learning, neural networks, gradient descent, and examples using logistic regression and multi-layer perceptrons. It also covers installing CNTK and related tools on Azure virtual machines to access GPUs for faster computation. Key steps outlined are downloading required software, configuring the Nvidia driver, running examples notebooks, and the basic principles of backpropagation for training neural networks.
Safe and Efficient Off-Policy Reinforcement Learningmooopan
This document summarizes the Retrace(λ) reinforcement learning algorithm presented by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare. Retrace(λ) is an off-policy multi-step reinforcement learning algorithm that is safe (converges for any policy), efficient (makes best use of samples when policies are close), and has lower variance than importance sampling. Empirical results on Atari 2600 games show Retrace(λ) outperforms one-step Q-learning and existing multi-step methods.
Streaming Analytics: It's Not the Same GameNumenta
This document discusses streaming analytics and how traditional machine learning algorithms are not well-suited for streaming data. It introduces Hierarchical Temporal Memory (HTM) as a new approach inspired by neuroscience that can handle streaming data, continuous learning, and temporal modeling. HTM uses sparse distributed representations and models sequences to make predictions and detect anomalies. The document provides examples of how HTM can be applied to problems like anomaly detection in server metrics, human behavior, geospatial tracking, social media streams, and stock prices. HTM algorithms are domain-independent and use the same codebase and parameters across different problem types.
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
This document discusses computing challenges posed by rapidly increasing data scales in scientific applications and high performance computing. It introduces the concept of online data analysis and reduction as an alternative to traditional offline analysis to help address these challenges. The key messages are that dramatic changes in HPC system geography due to different growth rates of technologies are driving new application structures and computational logistics problems, presenting exciting new computer science opportunities in online data analysis and reduction.
This document presents the πRT-calculus, a calculus for modeling mobile real-time processes. It extends the π-calculus with a timeout operator to model real-time aspects. The document covers the syntax and semantics of the π-calculus and πRT-calculus. It also discusses design choices like having a global clock and discrete time. An example of a mobile video streaming system is used to illustrate the πRT-calculus. The document concludes by discussing future work, like developing timed bisimulation and extending to continuous time.
This document summarizes an integrated control system design tool that was developed to analyze and design intelligent control systems. The tool allows users to:
- Build interactive CAD models for control system design
- Incorporate various libraries and integrate popular AI techniques
- Simulate systems using mixed continuous/discrete event simulation
- Represent systems with generalized Petri net and fuzzy system models
- Program using a new matrix-based language
- Model dead time elements flexibly
- Apply the tool to examples like adaptive fuzzy PID control and liquid level control
The document outlines the goals, background, achievements, related tools, and plans for future work regarding the development of the integrated intelligent control system design tool.
Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...Thomas Ploetz
Tutorial @Ubicomp 2015: Bridging the Gap -- Machine Learning for Ubiquitous Computing (machine learning and ubicomp primer session).
A tutorial on promises and pitfalls of Machine Learning for Ubicomp (and Human Computer Interaction). From Practitioners for Practitioners.
Presenter: Thomas Ploetz <tom.ploetz@gmail.com>
video recording of talks as they were held at Ubicomp:
https://youtu.be/LgnnlqOIXJc?list=PLh96aGaacSgXw0MyktFqmgijLHN-aQvdq
Introduction to computing Processing and performance.pdfTulasiramKandula1
This document discusses analyzing the performance of computer programs through empirical analysis and mathematical modeling. It provides an example of empirically analyzing the running time of a 3-sum problem algorithm by running experiments with increasing input sizes, measuring times, plotting the results, and fitting the data to a mathematical model. The analysis suggests the algorithm runs in O(N3) time. Doubling the input size and verifying the predicted running time supports the performance hypothesis.
Hardware Acceleration for Machine LearningCastLabKAIST
This document provides an overview of a lecture on hardware acceleration for machine learning. The lecture will cover deep neural network models like convolutional neural networks and recurrent neural networks. It will also discuss various hardware accelerators developed for machine learning, including those designed for mobile/edge and cloud computing environments. The instructor's background and the agenda topics are also outlined.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yyaHb8.
The authors discuss Netflix's new stream processing system that supports a reactive programming model, allows auto scaling, and is capable of processing millions of messages per second. Filmed at qconsf.com.
Danny Yuan is an architect and software developer in Netflix’s Platform Engineering team. Justin Becker is Senior Software Engineer at Netflix.
This is a talk given by Badrish Chandramouli at Portland State University on May 30, 2017, and overviews his recent and ongoing research directions in the space of stream processing and big data analytics.
The study on mining temporal patterns and related applications in dynamic soc...Thanh Hieu
The document provides a curriculum vitae for Yi-Cheng Chen that includes basic information, education history, and research interests. It notes that Chen received a B.S. from Yuan Ze University in 2000, an M.S. from National Taiwan University of Science and Technology in 2002, and a Ph.D. from National Chiao Tung University in 2012 under the advisement of Professors Suh-Yin Lee and Wen-Chih Peng. Chen's Ph.D. dissertation focused on time interval-based sequential pattern mining. The CV outlines Chen's current research interests as temporal pattern mining, social network analysis, smart home applications, and cloud computing.
This presentation describes a intelligent IT monitoring solution that uses Nagios as source of information, Esper as the CEP engine and a PCA algorithm.
Deep learning to the rescue - solving long standing problems of recommender ...Balázs Hidasi
I gave this talk at the 1st Budapest RecSys and Personalization Meetup about using deep learning to solve long standing problems of recommender systems. I also presented our approach on using RNNs for session-based recommendations in details.
Real time intrusion detection in network traffic using adaptive and auto-scal...Gobinath Loganathan
This document proposes an adaptive and auto-scaling stream processor called Wisdom to enable real-time intrusion detection in network traffic. Wisdom can dynamically optimize complex event processing (CEP) rules using hybrid optimization algorithms like particle swarm optimization and bisection. Tests show Wisdom can detect attacks like HTTP slow header denial of service and port scans with over 99.95% accuracy. Wisdom also allows functionally auto-scaling deployments of CEP rules to optimize resource usage.
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
EMI2lets is a middleware platform that aims to facilitate the development of context-aware mobile applications for ambient intelligence spaces. It uses mobile devices as universal remote controllers of "smart objects" in the environment. The EMI2lets platform allows physical objects and devices to be augmented with computational services, and discovers and interacts with these smart objects. It transfers small software components called EMI2lets from smart objects to mobile devices, allowing users to interact with and control smart objects through their phone or PDA. This transforms the environment into an ambient intelligence space and mobile devices into intelligent assistants.
In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. This talk will explain how deep learning, distributed traces, and time-series analysis (sequence analysis) can be used to effectively detect anomalous cloud infrastructure behaviors during operations to reduce the workload of human operators. The iForesight system is being used to evaluate this new O&M approach. iForesight 2.0 is the result of 2 years of research with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive.
This document presents a framework for verifying the safety of classification decisions made by deep neural networks. It defines safety as the network producing the same output classification for an input and any perturbations of that input within a bounded region. The framework uses satisfiability modulo theories (SMT) to formally verify safety by attempting to find an adversarial perturbation that causes misclassification. It has been tested on several image classification networks and datasets. The framework provides a method to automatically verify safety properties of deep neural networks.
Similar to ODSC 2019: Sessionisation via stochastic periods for root event identification (20)
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
2. Thales Overview
From the Bottom of the Oceans… to the Depths of
Space & Cyberspace
Key Digital Technologies
3. Thales: A Research and Development Powerhouse
6 times winner
2012, 2013,
2015, 2016,
2017, 2018
Expertise in a uniquely broad range
of technical domains, from science
to systems, applied across
businesses.
An extensive intellectual property
portfolio of 20,500 patents.
Albert Fert
Scientific director of the
CNRS/Thales joint physics
unit and winner of the
2007 Nobel prize in
physics.
4. Agenda
• Motivation of studying events
• Concept & purpose of Sessionisation
• Traditional approaches
• Real world case studies
• Applied Data Science way of doing Sessionisation
5. Events
• Orders placed in a market
• Sequence of user tweets
• User’s clicks on a website
• Activity update by an IoT device
• Network events on a router
• Network alarms in a network
12. Sessions: Operations vs Data Science view
Continuity in
activity
Mean activity period
Gap >> Mean activity period
Sessions Sessions Sessions
13. Sessions: Operations vs Data Science view
Continuity in
activity
Chain of time
sequenced events
Mean activity period
Gap >> Mean activity period
Sessions Sessions Sessions
Time based correlation
20. Malicious Actors in the world of AI
• Orders placed in a market: Market manipulation
• Sequence of user tweets : Bot campaigns
• User’s clicks on a website: Fraudulent transactions
• Activity by an IoT device: Taking device control
• Network events on a router: Cyber attacks
22. Approaches for finding time based patterns
• Fourier transform
• Time period – Stochastic periods
• GMM (Gaussian Mixture Models)
• Infinite GMM (Gaussian Mixture
Models)
• Non-parametric Bayesian methods
• Applied data science techniques
Information
Complexity
Applied data
science
30. Case studies via public datasets
• Sessionisation is an essential activity in detecting malicious bot
activities like Beaconing
• We will use 6th dataset of CTU-13 datasets for examples
• Provided by Czech Technical University (CTU)
• Traces captured from a malware attack executed in university network
• 6th dataset simulates a bot named DonBot, it attacks SVC services on Windows
• Dataset: https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-
Botnet-47/bro/conn.log
31. Case – 1: DonBot’s DNS queries to university DNS
server
35. Stochastic periods: Introduction
• Analyze periodicity in time domain
• Compute consecutive time deltas
• Real world signals are noisy so time deltas will vary a lot
• If there is periodicity in the signal, time deltas will vary in a band
• The density plot of time deltas will show some high density regions
• We can learn a probability distributions for each high density region
58. Auto discovering multiple distributions
tation Maximization
w to estimate parameter ?✓
Expectation Maximization
If sources are known,easy:
How to estimate parameter ?✓GMM - Gaussian Mixture Models
59. Auto discovering multiple distributions
sticmodel of data Gaussian Mixture Model (GMM)
GMM - Gaussian Mixture Models
60. GMM – Gaussian Mixture Models
• Does soft clustering of data points instead of hard clustering
• In principal it is very similar to K-Means but works on
probability
• K-Means: {P1 C1, P2 C2}, GMM: {P1 [0.8, 0.1, 0.1], P2 [0.05, 0.85, 0.1]}
• Problem with GMM & K-Means: We need to define “K”
• Techniques like Elbow method, Silhouette, etc. are based on certain assumptions
• Cannot be applied in general for automated discovery of K
• Finding “K” automatically is a very hard problem to solve
C1, C2, C3 C1, C2, C3
61. Bayesian way of building models
𝑃 𝜃 𝑋 =
𝑃 𝑋 𝜃 𝑃(𝜃)
𝑃(𝑋)
PriorLikelihood
Evidence
Posterior
𝑃 𝜃 𝑋 =
𝑃 𝑋 𝜃 𝑃(𝜃)
𝑃(𝑋)
𝑃(𝜃) is conjugate to 𝑃 𝑋 𝜃
A(𝜈’) A(𝜈)
For example:
P(𝜽) = 𝓝(𝜽|0, 1) # Standard normal
P(X|𝜽) = 𝓝(x|𝜽, 1) # with 1 std. dev
𝑃(𝜃|𝑋) ∝ 𝑒−
1
2(𝑥−𝜃)2
𝑒−
1
2 𝜃2
𝑃(𝜃|𝑋) ∝ 𝑒−(𝜃−
𝑥
2)2
P(𝜽|X) = 𝓝(𝜃|
𝑥
2
,
1
2
)
66. Sessionisation: Data Science at scale
• In a real world scenario, be it
• Web users over internet, Network hosts in an enterprise network, etc.
• One would need to apply Sessionisation on millions of entities
• So manual inspection based methods cannot be used
• We need a fully automated system to discover multiple
”Stochastic Periods”
• We need to find the clusters automatically
72. Intuition behind Infinite GMM
Properties of Dirichlet distribution
Dirichlet distribution is conjugate to Multinomial distribution
If π = 𝜋1, 𝜋2, … , 𝜋 𝑘 ~ 𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡(𝛼1, 𝛼2, … , 𝛼 𝑘)
Dirichlet Satisfies expansion or combination rule:
𝝅 𝟏 𝜽, 𝝅 𝟏(𝟏 − 𝜽), 𝝅 𝟐, … , 𝝅 𝒌 ~ 𝑫𝒊𝒓𝒊𝒄𝒉𝒍𝒆𝒕(𝜶 𝟏 𝒃 , 𝜶 𝟏(𝟏 − 𝒃), 𝜶 𝟐, … , 𝜶 𝒌)
Allows to increase the dimensionality of Dirichlet
Where 0 < b < 1 and 𝜃~𝐵𝑒𝑡𝑎(𝛼1 𝑏, 𝛼1 1 − 𝑏 )
73. Intuition behind Infinite GMM
Properties of Dirichlet distribution
Dirichlet distribution is conjugate to Multinomial distribution
If π = 𝜋1, 𝜋2, … , 𝜋 𝑘 ~ 𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡(𝛼1, 𝛼2, … , 𝛼 𝑘)
Dirichlet Satisfies expansion or combination rule:
𝝅 𝟏 𝜽, 𝝅 𝟏(𝟏 − 𝜽), 𝝅 𝟐, … , 𝝅 𝒌 ~ 𝑫𝒊𝒓𝒊𝒄𝒉𝒍𝒆𝒕(𝜶 𝟏 𝒃 , 𝜶 𝟏(𝟏 − 𝒃), 𝜶 𝟐, … , 𝜶 𝒌)
Allows to increase the dimensionality of Dirichlet
Where 0 < b < 1 and 𝜃~𝐵𝑒𝑡𝑎(𝛼1 𝑏, 𝛼1 1 − 𝑏 )
Dirichlet Process π(2)
= 𝜋1
(2)
, 𝜋2
(2)
~ 𝐷𝑖𝑟( 𝛼
2,
𝛼
2) ~ 𝐷𝑖𝑟( 𝛼
4,
𝛼
4,
𝛼
4,
𝛼
4) ~ 𝐷𝑖𝑟( 𝛼
𝐾
,…
𝛼
𝐾
)
𝑲 → ∞
74. Dirichlet process
21 : The Indian Bu↵et Process
Figure 2: On the left is an example of Indian Bu↵et Process dish assign
the right is an example binary matrix generated from IBP.
3. The nth customer helps himself to each dish with probability mk /
dish k was chosen.
4. He tries Poisson(↵/ n) new dishes.
Indian buffet processChinese restaurant process
Chinese restaurant process in action
77. Probabilistic modeling
• Probabilistic models captures the uncertainty better in real
world data
• But it is very computationally intensive
• The sampling process takes time to stabilize and then generate meaningful
results
• Certainly cannot work on large datasets
82. Obtaining stochastic periods recursively
Stochastic
periods
• Get probability distributions
from dense regions
Get dense
regions list
• Find dmin to cluster a
region
Recursively
split regions
• If region is:
{Heavy tailed, Multi-modal}
Kurtosis of normal distribution = 3
Heavy tailed: Excess Kurtosis > 6
𝐵𝑖𝑚𝑜𝑑𝑎𝑙𝑖𝑡𝑦 =
𝛾2
+ 1
𝜅
𝛾: Skewness
𝜅 : Kurtosis
Bimodality
for uniform
distribution 5/9
Bimodality > 0.8
Unimoda
l
Not unimodal
Unimo
dal ?
83. dmin via distance matrix
This topic was covered generically in detail during ODSC 2018 – “Topological space clustering”
84. dmin via distance matrix
This topic was covered generically in detail during ODSC 2018 – “Topological space clustering”
85. dmin via distance matrix
This topic was covered generically in detail during ODSC 2018 – “Topological space clustering”
If local dense regions exists along with
sparsity, then we can obtain hierarchical
clusters at each mode
87. Method proposed:
Finding optimal clustering-epsilon
• The problem comes down to finding the most optimal curve for the
Gaussian kernel
• One of the ways to solve it algorithmically
Grid Search
(band_width,
grid_size)
rFFT
Silverman
Transform
I-
rFFT
Score
(logLoss,
stdDev)
Minima
(band_width, grid_size)
88. Genuine systematic DNS queries to DNS server
Time delta
analysis
Time delta
Density
plot
92. Finite GMM: Bayesian setting
Algorithm: Collapsed Gibbs sampler for a finite Gaussian mixture model
Choose an initial z
For T iterations do # Gibbs sampling iterations
For i = 1 to N do
Remove xi ’s statistics from component zi # Old cluster assignment for xi
For k = 1 to K do # Every possible component
Calculate P(zi = k|zi , α)
Calculate p(xi |Xki , β)
Calculate P(zi = k|zi , X , α, β) ∝ P(zi = k|zi , α) p(xi |Xki , β)
End for
Sample knew from P(zi |zi , X , α, β) after normalizing
Add xi ’s statistics to the component zi = knew # New assignment for xi
End for
End for Evaluation metric for Gibbs: 𝑘=1
𝐾
𝑝 𝑋 𝑘 𝛽 𝑝(𝑧|𝛼)
93. Infinite GMM: Bayesian setting
Choose an initial z
For T iterations do # Gibbs sampling iterations
For i = 1 to N do
Remove xi ’s statistics from component zi # Old cluster assignment for xi
For k = 1 to K do # Every possible component
Calculate P(zi = k|zi , α)
Calculate p(xi |Xki , β)
Calculate P(zi = k|zi , X , α, β) ∝ P(zi = k|zi , α) p(xi |Xki , β)
End for
Calculate P (zi = k∗|zi, α) # Consider a new component
Calculate p(xi|β)
Calculate P (zi = k∗|zi, X , α, β) ∝ P (zi = k∗|zi, α) p(xi|β)
Sample knew from P(zi |zi , X , α, β) after normalizing
If any component is empty, remove it and decrease K
Add xi ’s statistics to the component zi = knew # New assignment for xi
End for
End for
80 000 employees in 68 countries, a global company
Heavy investments in innovation every year to develop state-of-the-art technologies: 1Bn€ invested in self-funded R&D