運用してわかったLookerの本質的メリット : Data Engineering Study #8Masatoshi Abe
Data Engineering Study #8 でのLooker事例発表です。
https://forkwell.connpass.com/event/209803/
BIツールの運用コストは「データアーキテクチャの複雑さ」と「アウトプットの多様さ」の掛け合わせで決まると考えています。その2つをマネジメントしやすいBIツールとして、株式会社ヤプリのLooker事例をご紹介します。Lookerは「強くてニューゲーム」ができる素敵なツールだと思います。
This document provides an overview of Raspberry Pi I/O control and sensor reading. It discusses analog and digital signal processing, common sensors interfaces like I2C, SPI and UART. It also covers programming the GPIO, reading analog sensors using the MCP3008 ADC over SPI, and interfacing with digital sensors like the LIS3DH accelerometer over I2C. Code examples are provided to read sensors and display the data.
今までウォーターフォールで開発していた開発チームがスクラムを導入し、今後はアジャイルで開発することが決まった。開発チームがアジャイル開発を行っている場合、品質は開発者のテスト能力に左右され、テストの専門家がそこに関与できるケースが少ない。当社に於いてもQAに課されたミッションは品質を落とさずにアジャイル開発に合った品質保証活動であった。今までウォーターフォール開発で構築してきた品質保証プロセスをスクラム型に改造し、SET(Software Engineer in Test)というQAスペシャリストをスクラムに参加させることにより、スクラム内で品質確保のための条件の定義や、テストプロセスを構築し、スクラムチーム全員の品質に対する意識改革を実践できた。
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
運用してわかったLookerの本質的メリット : Data Engineering Study #8Masatoshi Abe
Data Engineering Study #8 でのLooker事例発表です。
https://forkwell.connpass.com/event/209803/
BIツールの運用コストは「データアーキテクチャの複雑さ」と「アウトプットの多様さ」の掛け合わせで決まると考えています。その2つをマネジメントしやすいBIツールとして、株式会社ヤプリのLooker事例をご紹介します。Lookerは「強くてニューゲーム」ができる素敵なツールだと思います。
This document provides an overview of Raspberry Pi I/O control and sensor reading. It discusses analog and digital signal processing, common sensors interfaces like I2C, SPI and UART. It also covers programming the GPIO, reading analog sensors using the MCP3008 ADC over SPI, and interfacing with digital sensors like the LIS3DH accelerometer over I2C. Code examples are provided to read sensors and display the data.
今までウォーターフォールで開発していた開発チームがスクラムを導入し、今後はアジャイルで開発することが決まった。開発チームがアジャイル開発を行っている場合、品質は開発者のテスト能力に左右され、テストの専門家がそこに関与できるケースが少ない。当社に於いてもQAに課されたミッションは品質を落とさずにアジャイル開発に合った品質保証活動であった。今までウォーターフォール開発で構築してきた品質保証プロセスをスクラム型に改造し、SET(Software Engineer in Test)というQAスペシャリストをスクラムに参加させることにより、スクラム内で品質確保のための条件の定義や、テストプロセスを構築し、スクラムチーム全員の品質に対する意識改革を実践できた。
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
MOA is a framework for online learning from data streams. It is closely related to WEKA and includes tools for evaluation such as boosting, bagging, and Hoeffding Trees. MOA deals with evolving data streams and is easy to use and extend.
1. Real-time analytics of social networks can help companies detect new business opportunities by understanding customer needs and reactions in real-time.
2. MOA and SAMOA are frameworks for analyzing massive online and distributed data streams. MOA deals with evolving data streams using online learning algorithms. SAMOA provides a programming model for distributed, real-time machine learning on data streams.
3. Both tools allow companies to gain insights from social network and other real-time data to understand customers and react to opportunities.
This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet
This document discusses pitfalls in benchmarking data stream classification and proposes ways to avoid them. It analyzes the electricity market dataset, a popular benchmark, and finds that it exhibits temporal dependence that favors classifiers that simply predict the previous value. It introduces new evaluation metrics like kappa plus that account for temporal dependence by comparing to a "no change" classifier. It also proposes a temporally aware classifier called SWT that incorporates previous labels into its predictions. Experiments on electricity and forest cover datasets show SWT and the new metrics better capture classifier performance on temporally dependent streaming data.
Adaptive Learning and Mining for Data Streams and Frequent PatternsAlbert Bifet
This document summarizes Albert Bifet's 2009 PhD dissertation on adaptive learning and mining for data streams and frequent patterns. It introduces the challenges of mining massive, evolving structured data streams in real-time. It describes the ADWIN algorithm for detecting concept drift in data streams and outlines methods for mining evolving tree data streams, including incremental, sliding window, adaptive and logarithmic relaxed support approaches.
Implementation of adaptive stft algorithm for lfm signalseSAT Journals
Abstract
Normally Time-Frequency analysis is done by sliding a window through the time domain data and computing the Fourier
Transform of the data within the window. The choice of the window length determines whether specular or resonant information
will be emphasized. A narrow window will isolate specular reflections but will not be wide enough to accommodate the slowly
varying global resonances; a wide window cannot temporally separate resonance and specular information. So we will adapt
window length according to changes in frequencies. In this case we are realizing the specifications of Linear Frequency
Modulation (LFM) signal.
Index Terms—LFM, FFT, DFT, STFT and ASTFT.
This document discusses data stream management and streaming data warehouses. It defines key concepts like data streams, data stream management systems (DSMS), and streaming data warehouses (SDW). A DSMS processes continuous queries over data streams in real-time with low latency. An SDW integrates recent streaming data with historical data for analysis, using asynchronous and lightweight ETL processes. The document outlines components of a DSMS and SDW and algorithms for query processing, optimization, and load shedding in these systems.
Learnersourcing: Improving Learning with Collective Learner ActivityJuho Kim
Slides from my thesis defense: "Learnersourcing: Improving Learning with Collective Learner Activity"
Millions of learners today are watching videos on online platforms, such as Khan Academy, YouTube, Coursera, and edX, to take courses and master new skills. But existing video interfaces are not designed to support learning, with limited interactivity and lack of information about learners' engagement and content. Making these improvements requires deep semantic information about video that even state-of-the-art AI techniques cannot fully extract. I take a data-driven approach to address this challenge, using large-scale learning interaction data to dynamically improve video content and interfaces. Specifically, this thesis introduces learnersourcing, a form of crowdsourcing in which learners collectively contribute novel content for future learners while engaging in a meaningful learning experience themselves. I present learnersourcing applications designed for massive open online course videos and how-to tutorial videos, where learners' collective activities 1) highlight points of confusion or importance in a video, 2) extract a solution structure from a tutorial, and 3) improve the navigation experience for future learners. This thesis demonstrates how learnersourcing can enable more interactive, collaborative, and data-driven learning.
Ph.D. Research Update: Year#4 Annual Progress and Planned ActivitiesLighton Phiri
The document discusses research on streamlining the orchestration of learning activities through technology. It presents 5 studies: 1) a situational analysis of contemporary orchestration challenges, 2) a comparative analysis finding organized approaches faster and more successful, 3) orchestrating a flipped class feasibility study, 4) peer tutoring orchestration workload analysis, and 5) reusable online education resource orchestration packages. The goal is to make educators more effective by explicitly organizing enactment activities using an orchestration workbench. Results suggest the organized approach positively impacts the learning experience and workload is acceptable when scripting orchestration. Future work includes experiments and developing thesis manuscripts.
The Technical Debt Trap - AgileIndy 2013Doc Norton
The document discusses the concept of technical debt and how it differs from cruft. It explains that technical debt refers to strategic design decisions made to increase speed that require future work to pay back, while cruft is accidental messy or poorly-written code. The author warns that allowing cruft can lead teams into a trap where velocity is prioritized over quality, resulting in increasing cruft over time. The document provides suggestions for avoiding this trap, such as setting quality thresholds, making incremental fixes, and constantly cleaning code. It also recommends metrics for monitoring cruft and debt levels like code coverage, complexity, and coupling.
Brief report about the contents of the Stream Reasoning workshop at SIWC 2016. Additional info about the event are available at: http://streamreasoning.org/events/sr2016
1) The 1R algorithm generates a one-level decision tree by considering each attribute individually and creating branches for each attribute value. It assigns the majority class to each branch and chooses the attribute with the minimum error.
2) Naive Bayes classification assumes attributes are independent and calculates the probability of each class using Bayes' theorem. It handles missing and numeric attributes.
3) Decision tree algorithms like ID3 use a divide-and-conquer approach, selecting the attribute that maximizes information gain at each node to create branches. Gain ratio addresses issues with highly branched attributes.
Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production.
This document summarizes key concepts in nonparametric econometrics and kernel density estimation. It discusses bandwidth selection methods like cross-validation and plug-in approaches. It also covers multivariate density estimation, noting the trade-off between bias and variance. The document analyzes a real example from DiNardo and Tobias on estimating the density of female wages.
Network Based Kernel Density Estimation for Cycling Facilities Optimal Locati...Beniamino Murgante
Network Based Kernel Density Estimation for Cycling Facilities Optimal Location Applied to Ljubljana
Nicolas Lachance-Bernard, Timothée Produit - Ecole polytechnique fédérale de Lausanne
Biba Tominc, Matej Niksic, Barbara Golicnik Marusic - Urban Planning Institute of the Republic of Slovenia
The document discusses how emotions, behaviors, and ideas can spread through social networks like epidemics. It notes that behaviors like obesity, smoking, and depression have been shown to spread interpersonally through social ties. In contrast, positive attributes like happiness, kindness, and love may also spread in a contagious way. The document argues we should start an "epidemic" of spreading love through social networks since love has been shown to be a contagion. It concludes by providing references that explore how social networks and emotional contagion influence people.
Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier
This document discusses using data provenance to optimize re-execution of analytics pipelines and enable transparency in data science workflows. It proposes a framework called ReComp that selectively recomputes parts of expensive analytics workflows when inputs change based on provenance data. It also discusses applying provenance techniques to collect fine-grained data on data preparation steps in machine learning pipelines to help explain model decisions and data transformations. Early results suggest provenance can be collected with reasonable overhead and enables useful queries about pipeline execution.
Detecting Hacks: Anomaly Detection on Networking DataJames Sirota
See https://medium.com/@jamessirota for a series of blog entries that goes with this deck...
Defense in Depth for Big Data
Network Anomaly Detection Overview
Volume Anomaly Detection
Feature Anomaly Detection
Model Architecture
Deployment on OpenSOC Platform
Questions
The document discusses data provenance for data science applications. It proposes automatically generating and storing metadata that describes how data flows through a machine learning pipeline. This provenance information could help address questions about model predictions, data processing decisions, and regulatory requirements for high-risk AI systems. Capturing provenance at a fine-grained level incurs overhead but enables detailed queries. The approach was evaluated on performance and scalability. Provenance may help with transparency, explainability and oversight as required by new regulations.
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
The document proposes using perceptron learners at the leaves of Hoeffding decision trees to improve performance on data streams. It introduces a new evaluation metric called RAM-Hours that considers both time and memory usage. The authors empirically evaluate different classifier models, including Hoeffding trees with perceptron and naive Bayes learners at leaves, on several datasets. Results show that hybrid models like Hoeffding naive Bayes perceptron trees often provide the best balance of accuracy, time and memory usage.
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Gilles Fedak
Active Data : Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion.
A key point is to handle the complexity of the 'Data Life Cycle', i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing.
''Active Data'' is new approach to automate and improve the expressiveness of data management applications. It consists of
* a 'formal model' for Data Life Cycle, based on Petri Net, that allows to describe and expose data life cycle across heterogeneous systems and infrastructures.
* a 'programming model' allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data.
Detecting Hacks: Anomaly Detection on Networking DataDataWorks Summit
This document summarizes techniques for anomaly detection on network data. It discusses defense-in-depth strategies using both misuse detection and anomaly detection. It then describes volume-based and feature-based network anomaly detection, including statistical process control techniques. The document outlines a three-phase anomaly detection process and discusses implementation in Hadoop using time series databases. It provides examples of common network anomalies and techniques for batch and online anomaly detection modeling.
Introduction to Data streaming - 05/12/2014Raja Chiky
Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.
This document compares the performance of classification algorithms like decision tree, naive Bayes, and k-nearest neighbors using five data mining tools: RapidMiner, WEKA, Tanagra, Orange, and Knime. The algorithms are tested on an Indian liver patient dataset containing 416 liver patient and 167 non-liver patient records. Accuracy scores are reported from the confusion matrices generated by each tool. Overall, Knime achieved the highest accuracy for all three algorithms, with decision tree and k-nearest neighbor performing better than naive Bayes. WEKA had the lowest naive Bayes accuracy.
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...WSO2
In this webinar, Srinath Perera, director of research at WSO2, will discuss
Big data landscape: concepts, use cases, and technologies
Real-time analytics with WSO2 CEP
Batch analytics with WSO2 BAM
Combining batch and real-time analytics
Introducing WSO2 Machine Learner
The document discusses WSO2's big data platform and applications. It provides an overview of what can be done with big data, including optimizing processes to save money and resources, saving lives through advanced analytics of data like weather and health, and enabling new technologies through simulations. It then describes WSO2's big data architecture and analytics offerings, including the WSO2 Analytics Platform and its real-time and batch processing capabilities. Several use cases are presented, such as for smart homes, healthcare, cloud services, and more. The platform provides high-level query languages and supports multiple industries and custom extensions.
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
This document discusses an efficient framework called ReComp for re-computing big data analytics processes when inputs or algorithms change. ReComp uses fine-grained process provenance and execution history to estimate the impact of changes and selectively re-execute only affected parts. This can provide significant time savings over fully re-running processes from scratch. The framework was tested on two case studies: genomic variant analysis (SVI tool) and simulation modeling, demonstrating savings of 28-37% compared to complete re-execution. ReComp provides a generic approach but allows customization for specific processes and change types.
The document discusses several US grid projects including campus and regional grids like Purdue and UCLA that provide tens of thousands of CPUs and petabytes of storage. It describes national grids like TeraGrid and Open Science Grid that provide over a petaflop of computing power through resource sharing agreements. It outlines specific communities and projects using these grids for sciences like high energy physics, astronomy, biosciences, and earthquake modeling through the Southern California Earthquake Center. Software providers and toolkits that enable these grids are also mentioned like Globus, Virtual Data Toolkit, and services like Introduce.
1) Scientists at the Advanced Photon Source use the Argonne Leadership Computing Facility for data reconstruction and analysis from experimental facilities in real-time or near real-time. This provides feedback during experiments.
2) Using the Swift parallel scripting language and ALCF supercomputers like Mira, scientists can process terabytes of data from experiments in minutes rather than hours or days. This enables errors to be detected and addressed during experiments.
3) Key applications discussed include near-field high-energy X-ray diffraction microscopy, X-ray nano/microtomography, and determining crystal structures from diffuse scattering images through simulation and optimization. The workflows developed provide significant time savings and improved experimental outcomes.
The document discusses EDF's use of a data lake and data lab to optimize operations and safety at their nuclear power plants. It describes how EDF is building a Hadoop-based data lake called ESPADON to store sensor and operational data from their 59 nuclear plants. They are developing data science algorithms to analyze the data from the whole fleet to improve maintenance and operations. EDF is also creating a data lab team and architecture to develop analytics and quantify the value of these initiatives.
This document discusses the challenges of collecting, storing, and analyzing large volumes of internet measurement data. It examines issues such as distributed and resilient data collection, handling multi-timescale and heterogeneous data from various sources, and developing standardized tools and formats. The paper proposes the "datapository" - an internet data repository designed to address these challenges through a collaborative framework for data sharing, storage, and analysis tools. The goal is to help both network operators and researchers more effectively harness the wealth of data available.
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...Raffaele Montella
FACE-IT is an effort to develop a new IT infrastructure to accelerate existing disciplinary research and enable information transfer among traditionally separate fields. At present, finding data and processing it into usable form can dominate research efforts. By providing ready access to not only data but also the software tools used to process it for specific uses (e.g., climate impact and economic model inputs), FACE-IT allows researchers to concentrate their efforts on analysis. Lowering barriers to data access allows researchers to stretch in new directions and allows researchers to learn and respond to the needs of other fields. FACE-IT builds on the Globus Galaxies platform, which has been developed over the past several years at the University of Chicago. FACE-IT also benefit from substantial software development undertaken by the communities who have developed most of the domain-specific tools required to populate FACE-IT with useful capabilities. The FACE-IT Galaxy manages earth system datatypes (as NetCDF), new tool parameters (dates, map, opendap), aggregated datatypes (RAFT), service providers and cool map visualizers.
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
Hw09 Fingerpointing Sourcing Performance IssuesCloudera, Inc.
The document discusses automated problem diagnosis for large-scale distributed systems like Hadoop. It presents techniques for analyzing system logs and metrics to detect faults and localize their root causes. The goal is to automate diagnosis and provide early detection of problems to improve administrators' ability to manage complex systems.
Similar to Moa: Real Time Analytics for Data Streams (20)
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries, and frameworks.
2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection, and recommendations.
3) A key challenge addressed by SAMOA is how to perform distributed stream mining on high-volume, high-velocity data streams at low latency using approaches like Apache Flink that can scale to handle large, fast data.
Data science involves extracting insights from large volumes of data. It is an interdisciplinary field that uses techniques from statistics, machine learning, and other domains. The document provides examples of classification algorithms like k-nearest neighbors, naive Bayes, and perceptrons that are commonly used in data science to build models for tasks like spam filtering or sentiment analysis. It also discusses clustering, frequent pattern mining, and other machine learning concepts.
This document provides an introduction to big data and MapReduce frameworks. It discusses:
- What big data is and examples of large datasets.
- An overview of MapReduce, including how it allows programmers to break problems into parallelizable map and reduce tasks.
- Details of how MapReduce frameworks like Apache Hadoop work, including distributed processing, fault tolerance, and the roles of mappers, reducers, and other components.
The document discusses data stream classification and algorithms for handling data streams. It begins with an introduction to data stream characteristics and challenges. It then discusses approximation algorithms for data streams, including maintaining statistics over sliding windows. Classification algorithms for data streams discussed include Naive Bayes classifiers, perceptrons, and Hoeffding trees, which are decision trees adapted for data streams using the Hoeffding bound inequality to determine the optimal split attribute.
The document discusses real-time big data management and Apache Flink. It provides an overview of Apache Flink, including its architecture, components, and APIs for batch and streaming data processing. It also provides examples of word count programs in Java, Scala, and Java 8 that demonstrate how to write Flink programs for batch and streaming data.
Multi-label Classification with Meta-labelsAlbert Bifet
The area of multi-label classification has rapidly developed in recent years. It has become widely known that the baseline binary relevance approach suffers from class imbalance and a restricted hypothesis space that negatively affects its predictive performance, and can easily be outperformed by methods which learn labels together. A number of methods have grown around the label powerset approach, which models label combinations together as class values in a multi-class problem. We describe the label-powerset-based solutions under a general framework of \emph{meta-labels}. We provide theoretical justification for this framework which has been lacking, by viewing meta-labels as a hidden layer in an artificial neural network. We explain how meta-labels essentially allow a random projection into a space where non-linearities can easily be tackled with established linear learning algorithms. The proposed framework enables comparison and combination of related approaches to different multi-label problems. Indeed, we present a novel model in the framework and evaluate it empirically against several high-performing methods, with respect to predictive performance and scalability, on a number of datasets and evaluation metrics. Our deployment of an ensemble of meta-label classifiers obtains competitive accuracy for a fraction of the computation required by the current meta-label methods for multi-label classification.
STRIP: stream learning of influence probabilities.Albert Bifet
This document presents a method called STRIP (Streaming Learning of Influence Probabilities) for learning influence probabilities between users in a social network from a streaming log of propagations. It describes three solutions: (1) storing the whole social graph in memory, (2) using min-wise independent hashing to estimate probabilities while using sublinear space, and (3) estimating probabilities only for the most active users to be more space efficient. Experimental results on a Twitter dataset showed these solutions provided good approximations while using reasonable memory and processing time.
Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet
This document discusses efficient data stream classification using probabilistic adaptive windows. It introduces the concept of data streams which have potentially infinite sequences of high-speed data that must be processed in real-time with limited memory. It then describes the probabilistic approximate window (PAW) algorithm, which maintains a sample of data instances in logarithmic memory by giving greater weight to newer instances. The document evaluates several data stream classification methods on real and synthetic data streams and finds that k-nearest neighbors with PAW has higher accuracy and lower memory usage than other methods.
Big Data is a new term used to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.
Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.
The document outlines a tutorial on handling concept drift in machine learning. It discusses the challenges of concept drift when applying supervised learning algorithms to streaming data where the underlying data distribution changes over time. The tutorial aims to provide an integrated view of adaptive learning methods and how they can handle concept drift. It covers topics such as the problem of concept drift, techniques for handling drift, evaluating adaptive learning approaches, and applications that experience concept drift.
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
The document presents new methods for leveraging bagging for evolving data streams. It discusses using randomization techniques like Poisson distributions for input data and random output codes to increase diversity among classifiers. Experimental results on data streams with concept drift show the proposed methods like Leveraging Bagging and Leveraging Bagging MC improve accuracy over baselines like Hoeffding Trees and Online Bagging, while methods like Leveraging Bagging ME reduce RAM-Hours usage. The paper aims to improve accuracy and resource usage for data stream mining under concept drift.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Albert Bifet
This document discusses methods for mining data streams, which are potentially infinite sequences of data that change over time. It describes using the ADWIN algorithm, which is an adaptive sliding window technique without parameters, to extract information from data streams using few resources. It also covers mining massive data, where the amount of digital information created now exceeds available storage. Algorithmic efficiency is important for green computing approaches to efficiently using computing resources. The document provides an example of finding a missing number in an increasing sequence and using random sampling to find a number in the upper half of a sorted list using sublinear space and time.
Adaptive XML Tree Mining on Evolving Data StreamsAlbert Bifet
This document discusses mining frequent closed trees from XML data streams. It presents three algorithms for mining closed trees incrementally, with sliding windows, and adaptively using ADWIN to monitor change. Experimental results on real datasets show the adaptive approach using ADWIN achieves good accuracy while using limited memory.
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsAlbert Bifet
This document discusses mining frequent closed unlabeled rooted trees in data streams. It introduces the problem of finding frequent closed trees in a data stream of unlabeled rooted trees. It describes some of the challenges of data streams, including that the sequence is potentially infinite, there is a high amount of data requiring sublinear space, and a high speed of arrival requiring sublinear time per example. The document outlines an approach using ADWIN, an adaptive sliding window algorithm, to detect concept drift and adapt the window size accordingly.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
National Security Agency - NSA mobile device best practices
Moa: Real Time Analytics for Data Streams
1. MOA: Massive Online Analysis, a Framework for Stream
Classification and Clustering
Albert Bifet, Geoff Holmes, Bernhard Pfahringer,
Philipp Kranen, Hardy Kremer, Timm Jansen and Thomas Seidl
University of Waikato
Hamilton, New Zealand
Data Management and Data Exploration Group
RWTH Aachen University, Germany
Cumberland Lodge, 2 September 2010
Workshop on Applications of Pattern Analysis 2010
2. Mining Massive Data
2007
Digital Universe: 281 exabytes (billion gigabytes)
The amount of information created exceeded available
storage for the first time
Eric Schmidt, August 2010
Every two days now we create as much information as we did
from the dawn of civilization up until 2003.
5 exabytes of data
Twitter
106 million registered users
3 billion requests a day via its API.
2 / 21
3. Efficient Algorithms
Evolving Data Streams
Extract information from
potentially infinite sequence of data
possibly varying over time
using few resources
Stream Mining Algorithms
Fast methods without storing all dataset in memory
Traditional methods don’t deal with restrictions
3 / 21
4. What is MOA?
{M}assive {O}nline {A}nalysis is a framework for online learning
from data streams.
It is closely related to WEKA
It includes a collection of offline and online as well as tools
for evaluation:
classification
clustering
Easy to extend
Easy to design and run experiments
4 / 21
5. WEKA
Waikato Environment for Knowledge Analysis
Collection of state-of-the-art machine learning algorithms
and data processing tools implemented in Java
Released under the GPL
Support for the whole process of experimental data mining
Preparation of input data
Statistical evaluation of learning schemes
Visualization of input data and the result of learning
Used for education, research and applications
Complements “Data Mining” by Witten & Frank
5 / 21
7. MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
Weka, but also extinct.
7 / 21
8. MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
Weka, but also extinct.
7 / 21
9. MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
Weka, but also extinct.
7 / 21
10. Data stream learning cycle
1 Process an example at a time,
and inspect it only once (at
most)
2 Use a limited amount of
memory
3 Work in a limited amount of
time
4 Be ready to predict at any
point
8 / 21
12. Classification Experimental setting
Evaluation procedures for Data
Streams
Holdout
Interleaved Test-Then-Train or
Prequential
Environments
Sensor Network: 100Kb
Handheld Computer: 32 Mb
Server: 400 Mb
10 / 21
13. Classification Experimental setting
Data Sources
Random Tree Generator
Random RBF Generator
LED Generator
Waveform Generator
Hyperplane
SEA Generator
STAGGER Generator
10 / 21
14. Classification Experimental setting
Classifiers
Naive Bayes
Decision stumps
Hoeffding Tree
Hoeffding Option Tree
Bagging and Boosting
ADWIN Bagging and
Leveraging Bagging
Prediction strategies
Majority class
Naive Bayes Leaves
Adaptive Hybrid
10 / 21
16. Clustering Experimental setting
Internal measures External measures
Gamma Rand statistic
C Index Jaccard coefficient
Point-Biserial Folkes and Mallow Index
Log Likelihood Hubert Γ statistics
Dunn’s Index Minkowski score
Tau Purity
Tau A van Dongen criterion
Tau C V-measure
Somer’s Gamma Completeness
Ratio of Repetition Homogeneity
Modified Ratio of Repetition Variation of information
Adjusted Ratio of Clustering Mutual information
Fagan’s Index Class-based entropy
Deviation Index Cluster-based entropy
Z-Score Index Precision
D Index Recall
Silhouette coefficient F-measure
Table: Internal and external clustering evaluation measures.
12 / 21
20. Command Line
EvaluatePeriodicHeldOutTest
java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar
moa.DoTask "EvaluatePeriodicHeldOutTest
-l DecisionStump -s generators.WaveformGenerator
-n 100000 -i 100000000 -f 1000000" > dsresult.csv
This command creates a comma separated values file:
training the DecisionStump classifier on the WaveformGenerator
data,
using the first 100 thousand examples for testing,
training on a total of 100 million examples, and
testing every one million examples:
16 / 21
24. Summary
{M}assive {O}nline {A}nalysis is a framework for online learning
from data streams.
http://www.moa.cs.waikato.ac.nz
It is closely related to WEKA
It includes a collection of offline and online as well as tools
for evaluation:
classification
clustering
MOA deals with evolving data streams
MOA is easy to use and extend
20 / 21