This document discusses mining frequent closed trees from XML data streams. It presents three algorithms for mining closed trees incrementally, with sliding windows, and adaptively using ADWIN to monitor change. Experimental results on real datasets show the adaptive approach using ADWIN achieves good accuracy while using limited memory.
MOA is a framework for online learning from data streams. It is closely related to WEKA and includes tools for evaluation such as boosting, bagging, and Hoeffding Trees. MOA deals with evolving data streams and is easy to use and extend.
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
The document presents new methods for leveraging bagging for evolving data streams. It discusses using randomization techniques like Poisson distributions for input data and random output codes to increase diversity among classifiers. Experimental results on data streams with concept drift show the proposed methods like Leveraging Bagging and Leveraging Bagging MC improve accuracy over baselines like Hoeffding Trees and Online Bagging, while methods like Leveraging Bagging ME reduce RAM-Hours usage. The paper aims to improve accuracy and resource usage for data stream mining under concept drift.
This document provides information on Ebstein's anomaly, including its anatomy, embryology, clinical presentation, diagnosis, and natural history. Some key points:
- Ebstein's anomaly is a congenital defect involving downward displacement of the tricuspid valve into the right ventricle. This can cause dilation of the right atrium and dysfunction of the right ventricle.
- Clinical presentation varies from neonatal congestive heart failure to later cyanosis, arrhythmias, and right heart failure in adults. Associated defects are common.
- Diagnosis is made through echocardiogram demonstrating displacement of the tricuspid valve leaflets. Other tests like ECG, chest x-ray, and
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
MOA is a framework for online learning from data streams. It is closely related to WEKA and includes tools for evaluation such as boosting, bagging, and Hoeffding Trees. MOA deals with evolving data streams and is easy to use and extend.
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
The document presents new methods for leveraging bagging for evolving data streams. It discusses using randomization techniques like Poisson distributions for input data and random output codes to increase diversity among classifiers. Experimental results on data streams with concept drift show the proposed methods like Leveraging Bagging and Leveraging Bagging MC improve accuracy over baselines like Hoeffding Trees and Online Bagging, while methods like Leveraging Bagging ME reduce RAM-Hours usage. The paper aims to improve accuracy and resource usage for data stream mining under concept drift.
This document provides information on Ebstein's anomaly, including its anatomy, embryology, clinical presentation, diagnosis, and natural history. Some key points:
- Ebstein's anomaly is a congenital defect involving downward displacement of the tricuspid valve into the right ventricle. This can cause dilation of the right atrium and dysfunction of the right ventricle.
- Clinical presentation varies from neonatal congestive heart failure to later cyanosis, arrhythmias, and right heart failure in adults. Associated defects are common.
- Diagnosis is made through echocardiogram demonstrating displacement of the tricuspid valve leaflets. Other tests like ECG, chest x-ray, and
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries, and frameworks.
2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection, and recommendations.
3) A key challenge addressed by SAMOA is how to perform distributed stream mining on high-volume, high-velocity data streams at low latency using approaches like Apache Flink that can scale to handle large, fast data.
Data science involves extracting insights from large volumes of data. It is an interdisciplinary field that uses techniques from statistics, machine learning, and other domains. The document provides examples of classification algorithms like k-nearest neighbors, naive Bayes, and perceptrons that are commonly used in data science to build models for tasks like spam filtering or sentiment analysis. It also discusses clustering, frequent pattern mining, and other machine learning concepts.
This document provides an introduction to big data and MapReduce frameworks. It discusses:
- What big data is and examples of large datasets.
- An overview of MapReduce, including how it allows programmers to break problems into parallelizable map and reduce tasks.
- Details of how MapReduce frameworks like Apache Hadoop work, including distributed processing, fault tolerance, and the roles of mappers, reducers, and other components.
The document discusses data stream classification and algorithms for handling data streams. It begins with an introduction to data stream characteristics and challenges. It then discusses approximation algorithms for data streams, including maintaining statistics over sliding windows. Classification algorithms for data streams discussed include Naive Bayes classifiers, perceptrons, and Hoeffding trees, which are decision trees adapted for data streams using the Hoeffding bound inequality to determine the optimal split attribute.
The document discusses real-time big data management and Apache Flink. It provides an overview of Apache Flink, including its architecture, components, and APIs for batch and streaming data processing. It also provides examples of word count programs in Java, Scala, and Java 8 that demonstrate how to write Flink programs for batch and streaming data.
1. Real-time analytics of social networks can help companies detect new business opportunities by understanding customer needs and reactions in real-time.
2. MOA and SAMOA are frameworks for analyzing massive online and distributed data streams. MOA deals with evolving data streams using online learning algorithms. SAMOA provides a programming model for distributed, real-time machine learning on data streams.
3. Both tools allow companies to gain insights from social network and other real-time data to understand customers and react to opportunities.
Multi-label Classification with Meta-labelsAlbert Bifet
The area of multi-label classification has rapidly developed in recent years. It has become widely known that the baseline binary relevance approach suffers from class imbalance and a restricted hypothesis space that negatively affects its predictive performance, and can easily be outperformed by methods which learn labels together. A number of methods have grown around the label powerset approach, which models label combinations together as class values in a multi-class problem. We describe the label-powerset-based solutions under a general framework of \emph{meta-labels}. We provide theoretical justification for this framework which has been lacking, by viewing meta-labels as a hidden layer in an artificial neural network. We explain how meta-labels essentially allow a random projection into a space where non-linearities can easily be tackled with established linear learning algorithms. The proposed framework enables comparison and combination of related approaches to different multi-label problems. Indeed, we present a novel model in the framework and evaluate it empirically against several high-performing methods, with respect to predictive performance and scalability, on a number of datasets and evaluation metrics. Our deployment of an ensemble of meta-label classifiers obtains competitive accuracy for a fraction of the computation required by the current meta-label methods for multi-label classification.
Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet
This document discusses pitfalls in benchmarking data stream classification and proposes ways to avoid them. It analyzes the electricity market dataset, a popular benchmark, and finds that it exhibits temporal dependence that favors classifiers that simply predict the previous value. It introduces new evaluation metrics like kappa plus that account for temporal dependence by comparing to a "no change" classifier. It also proposes a temporally aware classifier called SWT that incorporates previous labels into its predictions. Experiments on electricity and forest cover datasets show SWT and the new metrics better capture classifier performance on temporally dependent streaming data.
STRIP: stream learning of influence probabilities.Albert Bifet
This document presents a method called STRIP (Streaming Learning of Influence Probabilities) for learning influence probabilities between users in a social network from a streaming log of propagations. It describes three solutions: (1) storing the whole social graph in memory, (2) using min-wise independent hashing to estimate probabilities while using sublinear space, and (3) estimating probabilities only for the most active users to be more space efficient. Experimental results on a Twitter dataset showed these solutions provided good approximations while using reasonable memory and processing time.
Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet
This document discusses efficient data stream classification using probabilistic adaptive windows. It introduces the concept of data streams which have potentially infinite sequences of high-speed data that must be processed in real-time with limited memory. It then describes the probabilistic approximate window (PAW) algorithm, which maintains a sample of data instances in logarithmic memory by giving greater weight to newer instances. The document evaluates several data stream classification methods on real and synthetic data streams and finds that k-nearest neighbors with PAW has higher accuracy and lower memory usage than other methods.
Big Data is a new term used to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.
Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.
The document outlines a tutorial on handling concept drift in machine learning. It discusses the challenges of concept drift when applying supervised learning algorithms to streaming data where the underlying data distribution changes over time. The tutorial aims to provide an integrated view of adaptive learning methods and how they can handle concept drift. It covers topics such as the problem of concept drift, techniques for handling drift, evaluating adaptive learning approaches, and applications that experience concept drift.
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
The document proposes using perceptron learners at the leaves of Hoeffding decision trees to improve performance on data streams. It introduces a new evaluation metric called RAM-Hours that considers both time and memory usage. The authors empirically evaluate different classifier models, including Hoeffding trees with perceptron and naive Bayes learners at leaves, on several datasets. Results show that hybrid models like Hoeffding naive Bayes perceptron trees often provide the best balance of accuracy, time and memory usage.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Albert Bifet
This document discusses methods for mining data streams, which are potentially infinite sequences of data that change over time. It describes using the ADWIN algorithm, which is an adaptive sliding window technique without parameters, to extract information from data streams using few resources. It also covers mining massive data, where the amount of digital information created now exceeds available storage. Algorithmic efficiency is important for green computing approaches to efficiently using computing resources. The document provides an example of finding a missing number in an increasing sequence and using random sampling to find a number in the upper half of a sorted list using sublinear space and time.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries, and frameworks.
2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection, and recommendations.
3) A key challenge addressed by SAMOA is how to perform distributed stream mining on high-volume, high-velocity data streams at low latency using approaches like Apache Flink that can scale to handle large, fast data.
Data science involves extracting insights from large volumes of data. It is an interdisciplinary field that uses techniques from statistics, machine learning, and other domains. The document provides examples of classification algorithms like k-nearest neighbors, naive Bayes, and perceptrons that are commonly used in data science to build models for tasks like spam filtering or sentiment analysis. It also discusses clustering, frequent pattern mining, and other machine learning concepts.
This document provides an introduction to big data and MapReduce frameworks. It discusses:
- What big data is and examples of large datasets.
- An overview of MapReduce, including how it allows programmers to break problems into parallelizable map and reduce tasks.
- Details of how MapReduce frameworks like Apache Hadoop work, including distributed processing, fault tolerance, and the roles of mappers, reducers, and other components.
The document discusses data stream classification and algorithms for handling data streams. It begins with an introduction to data stream characteristics and challenges. It then discusses approximation algorithms for data streams, including maintaining statistics over sliding windows. Classification algorithms for data streams discussed include Naive Bayes classifiers, perceptrons, and Hoeffding trees, which are decision trees adapted for data streams using the Hoeffding bound inequality to determine the optimal split attribute.
The document discusses real-time big data management and Apache Flink. It provides an overview of Apache Flink, including its architecture, components, and APIs for batch and streaming data processing. It also provides examples of word count programs in Java, Scala, and Java 8 that demonstrate how to write Flink programs for batch and streaming data.
1. Real-time analytics of social networks can help companies detect new business opportunities by understanding customer needs and reactions in real-time.
2. MOA and SAMOA are frameworks for analyzing massive online and distributed data streams. MOA deals with evolving data streams using online learning algorithms. SAMOA provides a programming model for distributed, real-time machine learning on data streams.
3. Both tools allow companies to gain insights from social network and other real-time data to understand customers and react to opportunities.
Multi-label Classification with Meta-labelsAlbert Bifet
The area of multi-label classification has rapidly developed in recent years. It has become widely known that the baseline binary relevance approach suffers from class imbalance and a restricted hypothesis space that negatively affects its predictive performance, and can easily be outperformed by methods which learn labels together. A number of methods have grown around the label powerset approach, which models label combinations together as class values in a multi-class problem. We describe the label-powerset-based solutions under a general framework of \emph{meta-labels}. We provide theoretical justification for this framework which has been lacking, by viewing meta-labels as a hidden layer in an artificial neural network. We explain how meta-labels essentially allow a random projection into a space where non-linearities can easily be tackled with established linear learning algorithms. The proposed framework enables comparison and combination of related approaches to different multi-label problems. Indeed, we present a novel model in the framework and evaluate it empirically against several high-performing methods, with respect to predictive performance and scalability, on a number of datasets and evaluation metrics. Our deployment of an ensemble of meta-label classifiers obtains competitive accuracy for a fraction of the computation required by the current meta-label methods for multi-label classification.
Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet
This document discusses pitfalls in benchmarking data stream classification and proposes ways to avoid them. It analyzes the electricity market dataset, a popular benchmark, and finds that it exhibits temporal dependence that favors classifiers that simply predict the previous value. It introduces new evaluation metrics like kappa plus that account for temporal dependence by comparing to a "no change" classifier. It also proposes a temporally aware classifier called SWT that incorporates previous labels into its predictions. Experiments on electricity and forest cover datasets show SWT and the new metrics better capture classifier performance on temporally dependent streaming data.
STRIP: stream learning of influence probabilities.Albert Bifet
This document presents a method called STRIP (Streaming Learning of Influence Probabilities) for learning influence probabilities between users in a social network from a streaming log of propagations. It describes three solutions: (1) storing the whole social graph in memory, (2) using min-wise independent hashing to estimate probabilities while using sublinear space, and (3) estimating probabilities only for the most active users to be more space efficient. Experimental results on a Twitter dataset showed these solutions provided good approximations while using reasonable memory and processing time.
Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet
This document discusses efficient data stream classification using probabilistic adaptive windows. It introduces the concept of data streams which have potentially infinite sequences of high-speed data that must be processed in real-time with limited memory. It then describes the probabilistic approximate window (PAW) algorithm, which maintains a sample of data instances in logarithmic memory by giving greater weight to newer instances. The document evaluates several data stream classification methods on real and synthetic data streams and finds that k-nearest neighbors with PAW has higher accuracy and lower memory usage than other methods.
Big Data is a new term used to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.
Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.
The document outlines a tutorial on handling concept drift in machine learning. It discusses the challenges of concept drift when applying supervised learning algorithms to streaming data where the underlying data distribution changes over time. The tutorial aims to provide an integrated view of adaptive learning methods and how they can handle concept drift. It covers topics such as the problem of concept drift, techniques for handling drift, evaluating adaptive learning approaches, and applications that experience concept drift.
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
The document proposes using perceptron learners at the leaves of Hoeffding decision trees to improve performance on data streams. It introduces a new evaluation metric called RAM-Hours that considers both time and memory usage. The authors empirically evaluate different classifier models, including Hoeffding trees with perceptron and naive Bayes learners at leaves, on several datasets. Results show that hybrid models like Hoeffding naive Bayes perceptron trees often provide the best balance of accuracy, time and memory usage.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Albert Bifet
This document discusses methods for mining data streams, which are potentially infinite sequences of data that change over time. It describes using the ADWIN algorithm, which is an adaptive sliding window technique without parameters, to extract information from data streams using few resources. It also covers mining massive data, where the amount of digital information created now exceeds available storage. Algorithmic efficiency is important for green computing approaches to efficiently using computing resources. The document provides an example of finding a missing number in an increasing sequence and using random sampling to find a number in the upper half of a sorted list using sublinear space and time.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
20240609 QFM020 Irresponsible AI Reading List May 2024
Adaptive XML Tree Mining on Evolving Data Streams
1. Adaptive XML Tree Mining on Evolving Data Streams
Albert Bifet
Laboratory for Relational Algorithmics, Complexity and Learning LARCA
Departament de Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya
Porto, 21 May 2009
2. Mining Evolving Massive Structured Data
The basic problem
Finding interesting structure
on data
Mining massive data
Mining time varying data
Mining on real time
Mining XML data
The Disintegration of Persistence
of Memory 1952-54
Salvador Dalí
2 / 30
3. XML Tree Classification on evolving data
streams
D D D D
B B B B B B B
C C C C C C
A A
C LASS 1 C LASS 2 C LASS 1 C LASS 2
D
Figure: A dataset example
3 / 30
4. Tree Pattern Mining
Given a dataset of trees, find the
complete set of frequent subtrees
Frequent Tree Pattern (FT):
Include all the trees whose
support is no less than min_sup
Closed Frequent Tree Pattern
(CT):
Include no tree which has a
Trees are sanctuaries. super-tree with the same
Whoever knows how support
to listen to them,
can learn the truth. CT ⊆ FT
Herman Hesse
4 / 30
5. Mining Closed Frequent Trees
Our trees are: Our subtrees are:
Labeled and Unlabeled Induced
Ordered and Unordered Top-down
Two different ordered trees
but the same unordered tree
5 / 30
6. A tale of two trees
Consider D = {A, B}, where Frequent subtrees
A B
A:
B:
and let min_sup = 2.
6 / 30
7. A tale of two trees
Consider D = {A, B}, where Closed subtrees
A B
A:
B:
and let min_sup = 2.
6 / 30
8. XML Tree Classification on evolving data
streams
D D D D
B B B B B B B
C C C C C C
A A
C LASS 1 C LASS 2 C LASS 1 C LASS 2
D
Figure: A dataset example
7 / 30
9. XML Tree Classification on evolving data
streams
Tree Trans.
Closed Freq. not Closed Trees 1 2 3 4
D
B B
c1 C C C C 1 0 1 0
D
B B C A
C C A
c2 A A 1 0 0 1
8 / 30
10. XML Tree Classification on evolving data
streams
Frequent Trees
c1 c2 c3 c4
Id 1
c1 f1 1 2 3
c2 f2 f2 f2 1
c3 f3 1 2 3 4 5
c4 f4 f4 f4 f4 f4
1 1 1 1 1 1 1 0 0 1 1 1 1 1 1
2 0 0 0 0 0 0 1 1 1 1 1 1 1 1
3 1 1 0 0 0 0 1 1 1 1 1 1 1 1
4 0 0 1 1 1 1 1 1 1 1 1 1 1 1
Closed Maximal
Trees Trees
Id Tree c1 c2 c3 c4 c1 c2 c3 Class
1 1 1 0 1 1 1 0 C LASS 1
2 0 0 1 1 0 0 1 C LASS 2
3 1 0 1 1 1 0 1 C LASS 1
4 0 1 1 1 0 1 1 C LASS 2
9 / 30
11. XML Tree Framework on evolving data
streams
XML Tree Classification Framework Components
An XML closed frequent tree miner
A Data stream classifier algorithm, which we will feed with tuples
to be classified online.
10 / 30
12. Mining Evolving Tree Data Streams
Problem
Given a data stream D of rooted and unordered trees, find
frequent closed trees.
We provide three algorithms,
of increasing power
Incremental
Sliding Window
Adaptive
D
11 / 30
13. Mining Closed Unordered Subtrees
C LOSED _S UBTREES(t, D, min_sup, T )
1
2
3 for every t that can be extended from t in one step
4 do if Support(t ) ≥ min_sup
5 then T ← C LOSED _S UBTREES(t , D, min_sup, T )
6
7
8
9
10 return T
12 / 30
14. Mining Closed Unordered Subtrees
C LOSED _S UBTREES(t, D, min_sup, T )
1 if not C ANONICAL _R EPRESENTATIVE(t)
2 then return T
3 for every t that can be extended from t in one step
4 do if Support(t ) ≥ min_sup
5 then T ← C LOSED _S UBTREES(t , D, min_sup, T )
6
7
8
9
10 return T
12 / 30
15. Mining Closed Unordered Subtrees
C LOSED _S UBTREES(t, D, min_sup, T )
1 if not C ANONICAL _R EPRESENTATIVE(t)
2 then return T
3 for every t that can be extended from t in one step
4 do if Support(t ) ≥ min_sup
5 then T ← C LOSED _S UBTREES(t , D, min_sup, T )
6 do if Support(t ) = Support(t)
7 then t is not closed
8 if t is closed
9 then insert t into T
10 return T
12 / 30
18. Experimental results
TreeNat CMTreeMiner
Unlabeled Trees Labeled Trees
Top-Down Subtrees Induced Subtrees
No Occurrences Occurrences
14 / 30
19. Closure Operator on Trees
D: the finite input dataset of trees
T : the (infinite) set of all trees
Definition
We define the following the Galois connection pair:
For finite A ⊆ D
σ (A) is the set of subtrees of the A trees in T
σ (A) = {t ∈ T ∀ t ∈ A (t t )}
For finite B ⊂ T
τD (B) is the set of supertrees of the B trees in D
τD (B) = {t ∈ D ∀ t ∈ B (t t )}
Closure Operator
The composition ΓD = σ ◦ τD is a closure operator.
15 / 30
21. Galois Lattice of closed set of trees
1 2 3
D
B={ } 12 13 23
123
17 / 30
22. Galois Lattice of closed set of trees
B={ }
1 2 3
τD (B) = { , }
12 13 23
123
17 / 30
23. Galois Lattice of closed set of trees
B={ }
1 2 3
τD (B) = { , }
12 13 23
ΓD (B) = σ ◦τD(B) = { and its subtrees }
123
17 / 30
24. Algorithms
Algorithms
Incremental: I NC T REE N AT
Sliding Window: W IN T REE N AT
Adaptive: A DAT REE N AT Uses ADWIN to monitor change
ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.
ADWIN has rigorous guarantees (theorems)
On ratio of false positives and false negatives
On the relation of the size of the current window and change
rates
18 / 30
25. Experimental Validation: TN1
CMTreeMiner
300
Time 200
(sec.)
100
I NC T REE N AT
2 4 6 8
Size (Milions)
Figure: Experiments on ordered trees with TN1 dataset
19 / 30
26. What is MOA?
{M}assive {O}nline {A}nalysis is a framework for online learning
from data streams.
It is closely related to WEKA
It includes a collection of offline and online as well as tools for
evaluation:
boosting and bagging
Hoeffding Trees
with and without Naïve Bayes classifiers at the leaves.
20 / 30
28. MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
Weka, but also extinct.
22 / 30
29. MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
Weka, but also extinct.
22 / 30
30. MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
Weka, but also extinct.
22 / 30
31. Data stream classification cycle
1 Process an example at a
time, and inspect it only
once (at most)
2 Use a limited amount of
memory
3 Work in a limited amount
of time
4 Be ready to predict at any
point
23 / 30
32. Environments and Data Sources
Environments
Sensor Network: 100Kb
Handheld Computer: 32 Mb
Server: 400 Mb
Data Sources
Random Tree Generator
Random RBF Generator
LED Generator
Waveform Generator
Function Generator
24 / 30
33. Algorithms
Naive Bayes Prediction strategies
Decision stumps Majority class
Hoeffding Tree Naive Bayes Leaves
Hoeffding Option Tree Adaptive Hybrid
Bagging and Boosting
25 / 30
34. Hoeffding Option Tree
Hoeffding Option Trees
Regular Hoeffding tree containing additional option nodes that
allow several tests to be applied, leading to multiple Hoeffding
trees as separate paths.
26 / 30
37. Ensemble Methods
http://www.cs.waikato.ac.nz/∼abifet/MOA/
New ensemble methods:
ADWIN bagging: When a change is detected, the worst classifier
is removed and a new classifier is added.
Adaptive-Size Hoeffding Tree bagging
28 / 30
38. XML Tree Framework on evolving data
streams
Maximal Closed
# Trees Att. Acc. Mem. Att. Acc. Mem.
CSLOG12 15483 84 79.64 1.2 228 78.12 2.54
CSLOG23 15037 88 79.81 1.21 243 78.77 2.75
CSLOG31 15702 86 79.94 1.25 243 77.60 2.73
CSLOG123 23111 84 80.02 1.7 228 78.91 4.18
Table: BAGGING on unordered trees.
29 / 30
39. Conclusions
XML tree stream classifier system.
Using Galois Latice Theory, we present methods for mining
closed trees
Incremental
Sliding Window
Adaptive: using ADWIN to monitor change
We use MOA data stream classifiers.
30 / 30