Genta Yoshimura, Atsunori Kanemura, Hideki Asoh,
"Enumerating Hub Motifs in Time Series Based on the Matrix Profile,"
5th Workshop on Mining and Learning from Time Series (MiLeTS’19).
Anomaly detection using One-Class Neural Networks (OCNN)DaeJin Kim
This document discusses anomaly detection using one-class neural networks (OC-NN). It begins by describing existing approaches like one-class SVM, which uses an autoencoder to extract features followed by a one-class SVM to detect outliers. The motivation for OC-NN is that the feature extraction in these hybrid models is generic and not tailored for anomaly detection. OC-NN instead trains an autoencoder with a one-class SVM-like loss function to directly influence the representational learning for anomaly detection. The model is evaluated on several datasets and is shown to outperform existing state-of-the-art anomaly detection methods.
This document discusses maximum entropy models and their application to natural language processing tasks. It introduces maximum entropy modeling, describing how the approach works by choosing a probability distribution with maximum entropy subject to constraints from observed data. Training involves using algorithms like generalized iterative scaling to find optimal feature weights that satisfy the constraints. Maximum entropy models have been applied successfully to tasks like part-of-speech tagging, machine translation, and language modeling.
This document discusses different network topologies including mesh, star, bus, ring, tree, and hybrid topologies. A network topology refers to the layout of connected devices on a network. Mesh topology connects all devices to each other, star topology connects all devices to a central hub, and bus topology connects all devices to a single cable in a line. Ring and tree topologies connect devices in circular and branching patterns, while hybrid uses elements of different topologies.
The document discusses clustering techniques and provides details about the k-means clustering algorithm. It begins with an introduction to clustering and lists different clustering techniques. It then describes the k-means algorithm in detail, including how it works, the steps involved, and provides an example illustration. Finally, it discusses comments on the k-means algorithm, focusing on aspects like choosing the value of k, initializing cluster centroids, and different distance measurement methods.
Mining sequential patterns for interval basedijcsa
Sequential pattern mining finds the frequent subsequences or patterns from the given sequences.
TPrefixSpan algorithm finds the relevant frequent patterns from the given sequential patterns formed using
interval based events. In our proposed work, we add multiple constraints like item, length and aggregate to
the interval based TPrefixSpan algorithm. By adding these constraints the efficiency and effectiveness of
the algorithm improves. The proposed constraint based algorithm CTPrefixSpan has been applied to
synthetic medical dataset. The algorithm can be applied for stock market analysis, DNA sequences analysis
etc.
KEYWORDS
Sequential patterns, temporal patterns, Constraints, Interval based events.
1.5.ensemble learning with apache spark m llib 1.5leorick lin
This document discusses ensemble learning techniques in Apache Spark MLlib 1.5. It begins by defining ensemble learning as combining multiple learning modules to increase model stability and predictive power. It then provides examples of sources of variation between models, such as different data samples, assumptions, modeling techniques, and initialization parameters. Next, it demonstrates mathematically how combining three 70% accurate binary classifiers through majority voting can achieve over 78% accuracy. The remainder of the document explains techniques like bagging, boosting, and stacking to reduce bias and variance errors by leveraging diverse models. It provides an example using stacking with Spark MLlib on a dataset to improve over a random forest baseline precision.
Combining Inverted Indices and Structured Search for Ad-hoc Object RetrievaleXascale Infolab
Slides for the paper "Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval" by Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux presented at SIGIR2012
Anomaly detection using One-Class Neural Networks (OCNN)DaeJin Kim
This document discusses anomaly detection using one-class neural networks (OC-NN). It begins by describing existing approaches like one-class SVM, which uses an autoencoder to extract features followed by a one-class SVM to detect outliers. The motivation for OC-NN is that the feature extraction in these hybrid models is generic and not tailored for anomaly detection. OC-NN instead trains an autoencoder with a one-class SVM-like loss function to directly influence the representational learning for anomaly detection. The model is evaluated on several datasets and is shown to outperform existing state-of-the-art anomaly detection methods.
This document discusses maximum entropy models and their application to natural language processing tasks. It introduces maximum entropy modeling, describing how the approach works by choosing a probability distribution with maximum entropy subject to constraints from observed data. Training involves using algorithms like generalized iterative scaling to find optimal feature weights that satisfy the constraints. Maximum entropy models have been applied successfully to tasks like part-of-speech tagging, machine translation, and language modeling.
This document discusses different network topologies including mesh, star, bus, ring, tree, and hybrid topologies. A network topology refers to the layout of connected devices on a network. Mesh topology connects all devices to each other, star topology connects all devices to a central hub, and bus topology connects all devices to a single cable in a line. Ring and tree topologies connect devices in circular and branching patterns, while hybrid uses elements of different topologies.
The document discusses clustering techniques and provides details about the k-means clustering algorithm. It begins with an introduction to clustering and lists different clustering techniques. It then describes the k-means algorithm in detail, including how it works, the steps involved, and provides an example illustration. Finally, it discusses comments on the k-means algorithm, focusing on aspects like choosing the value of k, initializing cluster centroids, and different distance measurement methods.
Mining sequential patterns for interval basedijcsa
Sequential pattern mining finds the frequent subsequences or patterns from the given sequences.
TPrefixSpan algorithm finds the relevant frequent patterns from the given sequential patterns formed using
interval based events. In our proposed work, we add multiple constraints like item, length and aggregate to
the interval based TPrefixSpan algorithm. By adding these constraints the efficiency and effectiveness of
the algorithm improves. The proposed constraint based algorithm CTPrefixSpan has been applied to
synthetic medical dataset. The algorithm can be applied for stock market analysis, DNA sequences analysis
etc.
KEYWORDS
Sequential patterns, temporal patterns, Constraints, Interval based events.
1.5.ensemble learning with apache spark m llib 1.5leorick lin
This document discusses ensemble learning techniques in Apache Spark MLlib 1.5. It begins by defining ensemble learning as combining multiple learning modules to increase model stability and predictive power. It then provides examples of sources of variation between models, such as different data samples, assumptions, modeling techniques, and initialization parameters. Next, it demonstrates mathematically how combining three 70% accurate binary classifiers through majority voting can achieve over 78% accuracy. The remainder of the document explains techniques like bagging, boosting, and stacking to reduce bias and variance errors by leveraging diverse models. It provides an example using stacking with Spark MLlib on a dataset to improve over a random forest baseline precision.
Combining Inverted Indices and Structured Search for Ad-hoc Object RetrievaleXascale Infolab
Slides for the paper "Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval" by Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux presented at SIGIR2012
What am I going to get from this course?
Provides a basic conceptual understanding of how clustering works
Provides intuitive understanding of the mathematics behind various clustering algorithms
Walk through Python code examples on how to use various cluster algorithms
Show how clustering is applied in various industry applications
Check it on Experfy: https://www.experfy.com/training/courses/unsupervised-learning-clustering
K-means clustering is an unsupervised learning algorithm that partitions observations into K clusters by minimizing the within-cluster sum of squares. It works by iteratively assigning observations to the closest cluster centroid and recalculating centroids until convergence. K-means requires the number of clusters K as an input and is sensitive to initialization but is widely used for clustering large datasets due to its simplicity and efficiency.
This document presents a methodology for applying text mining techniques to SQL query logs from the Sloan Digital Sky Survey (SDSS) SkyServer database. The methodology involves parsing, cleaning, and tokenizing SQL queries to represent them as feature vectors that can be analyzed using data mining algorithms. Experimental results demonstrate clustering SQL queries using fuzzy c-means clustering and visualizing relationships between queries using self-organizing maps. The methodology is intended to provide insights into database usage patterns from analysis of the SQL query logs.
This document proposes a project to recognize human activities from observation data using data mining techniques. It will use a dataset containing timestamped coordinate data from tags on people performing different activities. The project will preprocess the data, apply sequential pattern mining using the GSP algorithm to discover patterns of tag movements, evaluate the results using cross-validation, and analyze the learning outcomes of applying data mining to human activity recognition.
1) The document proposes a graph-based method using graph convolutional networks to address challenges in sequential recommendation, such as extracting implicit preferences from long behavior sequences and adapting to changing user preferences over time.
2) It constructs an interest graph from user behaviors and designs an attentive graph convolutional network and dynamic pooling technique to aggregate implicit signals into explicit preferences.
3) Experimental results on two large-scale datasets show the proposed method significantly outperforms state-of-the-art sequential recommendation methods.
SYNOPSIS on Parse representation and Linear SVM.bhavinecindus
1. The document discusses a thesis on using sparse feature parameterization and multi-kernel SVM for large scale scene classification. The objective is to improve accuracy for large datasets using sparse representations and machine learning algorithms.
2. Key challenges include high dimensionality reducing accuracy for large datasets, nonlinear distributions, and computational costs of deep learning models. The research aims to address these issues.
3. The motivation from literature shows that multi-kernel SVMs have proved effective but could be improved by minimizing redundancy and optimizing kernel parameters for feature sets.
The document discusses chapter 8 of a textbook on data mining concepts and techniques. It covers various topics related to cluster analysis, including what cluster analysis is, different types of data that can be used for cluster analysis, major categories of clustering methods like partitioning, hierarchical, density-based, grid-based, and model-based methods. It also discusses outlier analysis and provides examples of clustering applications.
This document discusses cluster analysis and clustering algorithms. It defines a cluster as a collection of similar data objects that are dissimilar from objects in other clusters. Unsupervised learning is used with no predefined classes. Popular clustering algorithms include k-means, hierarchical, density-based, and model-based approaches. Quality clustering produces high intra-class similarity and low inter-class similarity. Outlier detection finds dissimilar objects to identify anomalies.
The document discusses JARVIS-ML, an AI system for fast and accurate screening of materials properties. It uses machine learning models trained on a large dataset of materials properties calculated using density functional theory. Some key points:
- JARVIS-ML uses gradient boosting decision trees to predict properties like formation energies, bandgaps, and elastic moduli, achieving good accuracy compared to DFT calculations.
- Feature selection is important, and JARVIS-ML uses over 1,500 descriptors of atomic structure. Chemical features are most important for predictions.
- The models can screen thousands of materials in seconds, much faster than DFT. This enables large-scale materials discovery tasks like genetic algorithm searches.
Research Inventy : International Journal of Engineering and Scienceinventy
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
The document describes a proposed hybrid method for multichannel signal separation using supervised nonnegative matrix factorization (SNMF). The method combines directional clustering for spatial separation with SNMF incorporating spectrogram restoration for spectral separation. Experiments show the hybrid method achieves better separation performance than conventional single-channel SNMF or multichannel NMF methods, as measured by signal-to-distortion ratio. The optimal divergence for the SNMF component involves a tradeoff between separation ability and ability to restore missing spectral components.
Machine Learning Techniques for the Smart Grid – Modeling of Solar Energy usi...Wilfried Elmenreich
This talk covers the application of machine learning techniques for energy applications, in particular for modeling solar radiation. The first part explores meta-heuristic search algorithms and envisioned their application for designing distributed, self-organizing control systems using evolutionary algorithms. The second part gives an introduction to solar radiation modeling and shows how neural networks can be used to artificial neural networks to learn the correlation of input parameters such as latitude, longitude, temperature, humidity, month, day, hour to predict global and diffuse solar radiation.
Hybrid multichannel signal separation using supervised nonnegative matrix fac...Daichi Kitamura
Presented at Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2014 (APSIPA 2014, international conference)
Daichi Kitamura, Hiroshi Saruwatari, Satoshi Nakamura, Yu Takahashi, Kazunobu Kondo, Hirokazu Kameoka, "Hybrid multichannel signal separation using supervised nonnegative matrix factorization with spectrogram restoration," Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2014 (APSIPA 2014), Siem Reap, Cambodia, December 2014 (invited paper).
The document discusses using machine learning algorithms to estimate the periods of variable stars from their light curves. It reviews existing period estimation methods and the problem of discriminating the true period from spurious periods. It then explores using a machine learning approach, focusing on extracting features from light curves and using algorithms like Prism to classify samples and determine periods. The preliminary experiments showed Prism had the highest performance but challenges remained around feature selection and noise elimination preprocessing.
Time-delayed collective flow diffusion models for inferring latent people flo...Shun Kojima
This document summarizes a research paper that proposes a new model called Time-delayed Collective Flow Diffusion Models (T-CFDM) to infer latent people flow from limited aggregated location data. T-CFDM extends previous Collective Flow Diffusion Models (CFDM) by incorporating a probability distribution of travel time between areas and improving robustness to noisy data. An efficient EM algorithm is also proposed to simultaneously optimize parameters like movement probabilities, numbers, time distributions, and noise variance. The accuracy of the approach is validated on real-world datasets of exhibitor foot traffic and New York City bicycle and taxi flows.
Large scale cell tracking using an approximated Sinkhorn algorithmParth Nandedkar
Cell tracking for a large scale (of over 1 million cells) has not yet been achievable within reasonable a time scope with current NN/RNN/Bi-RNN based methods. This individual research conducted by me at Osaka University, ISIR seeks to solve this problem using the Sinkhorn algorithm, and taking inspiration from the MPM method (Hayashida, 2020)
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
More Related Content
Similar to MiLeTS'19: Enumerating Hub Motifs in Time Series Based on the Matrix Profile
What am I going to get from this course?
Provides a basic conceptual understanding of how clustering works
Provides intuitive understanding of the mathematics behind various clustering algorithms
Walk through Python code examples on how to use various cluster algorithms
Show how clustering is applied in various industry applications
Check it on Experfy: https://www.experfy.com/training/courses/unsupervised-learning-clustering
K-means clustering is an unsupervised learning algorithm that partitions observations into K clusters by minimizing the within-cluster sum of squares. It works by iteratively assigning observations to the closest cluster centroid and recalculating centroids until convergence. K-means requires the number of clusters K as an input and is sensitive to initialization but is widely used for clustering large datasets due to its simplicity and efficiency.
This document presents a methodology for applying text mining techniques to SQL query logs from the Sloan Digital Sky Survey (SDSS) SkyServer database. The methodology involves parsing, cleaning, and tokenizing SQL queries to represent them as feature vectors that can be analyzed using data mining algorithms. Experimental results demonstrate clustering SQL queries using fuzzy c-means clustering and visualizing relationships between queries using self-organizing maps. The methodology is intended to provide insights into database usage patterns from analysis of the SQL query logs.
This document proposes a project to recognize human activities from observation data using data mining techniques. It will use a dataset containing timestamped coordinate data from tags on people performing different activities. The project will preprocess the data, apply sequential pattern mining using the GSP algorithm to discover patterns of tag movements, evaluate the results using cross-validation, and analyze the learning outcomes of applying data mining to human activity recognition.
1) The document proposes a graph-based method using graph convolutional networks to address challenges in sequential recommendation, such as extracting implicit preferences from long behavior sequences and adapting to changing user preferences over time.
2) It constructs an interest graph from user behaviors and designs an attentive graph convolutional network and dynamic pooling technique to aggregate implicit signals into explicit preferences.
3) Experimental results on two large-scale datasets show the proposed method significantly outperforms state-of-the-art sequential recommendation methods.
SYNOPSIS on Parse representation and Linear SVM.bhavinecindus
1. The document discusses a thesis on using sparse feature parameterization and multi-kernel SVM for large scale scene classification. The objective is to improve accuracy for large datasets using sparse representations and machine learning algorithms.
2. Key challenges include high dimensionality reducing accuracy for large datasets, nonlinear distributions, and computational costs of deep learning models. The research aims to address these issues.
3. The motivation from literature shows that multi-kernel SVMs have proved effective but could be improved by minimizing redundancy and optimizing kernel parameters for feature sets.
The document discusses chapter 8 of a textbook on data mining concepts and techniques. It covers various topics related to cluster analysis, including what cluster analysis is, different types of data that can be used for cluster analysis, major categories of clustering methods like partitioning, hierarchical, density-based, grid-based, and model-based methods. It also discusses outlier analysis and provides examples of clustering applications.
This document discusses cluster analysis and clustering algorithms. It defines a cluster as a collection of similar data objects that are dissimilar from objects in other clusters. Unsupervised learning is used with no predefined classes. Popular clustering algorithms include k-means, hierarchical, density-based, and model-based approaches. Quality clustering produces high intra-class similarity and low inter-class similarity. Outlier detection finds dissimilar objects to identify anomalies.
The document discusses JARVIS-ML, an AI system for fast and accurate screening of materials properties. It uses machine learning models trained on a large dataset of materials properties calculated using density functional theory. Some key points:
- JARVIS-ML uses gradient boosting decision trees to predict properties like formation energies, bandgaps, and elastic moduli, achieving good accuracy compared to DFT calculations.
- Feature selection is important, and JARVIS-ML uses over 1,500 descriptors of atomic structure. Chemical features are most important for predictions.
- The models can screen thousands of materials in seconds, much faster than DFT. This enables large-scale materials discovery tasks like genetic algorithm searches.
Research Inventy : International Journal of Engineering and Scienceinventy
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
The document describes a proposed hybrid method for multichannel signal separation using supervised nonnegative matrix factorization (SNMF). The method combines directional clustering for spatial separation with SNMF incorporating spectrogram restoration for spectral separation. Experiments show the hybrid method achieves better separation performance than conventional single-channel SNMF or multichannel NMF methods, as measured by signal-to-distortion ratio. The optimal divergence for the SNMF component involves a tradeoff between separation ability and ability to restore missing spectral components.
Machine Learning Techniques for the Smart Grid – Modeling of Solar Energy usi...Wilfried Elmenreich
This talk covers the application of machine learning techniques for energy applications, in particular for modeling solar radiation. The first part explores meta-heuristic search algorithms and envisioned their application for designing distributed, self-organizing control systems using evolutionary algorithms. The second part gives an introduction to solar radiation modeling and shows how neural networks can be used to artificial neural networks to learn the correlation of input parameters such as latitude, longitude, temperature, humidity, month, day, hour to predict global and diffuse solar radiation.
Hybrid multichannel signal separation using supervised nonnegative matrix fac...Daichi Kitamura
Presented at Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2014 (APSIPA 2014, international conference)
Daichi Kitamura, Hiroshi Saruwatari, Satoshi Nakamura, Yu Takahashi, Kazunobu Kondo, Hirokazu Kameoka, "Hybrid multichannel signal separation using supervised nonnegative matrix factorization with spectrogram restoration," Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2014 (APSIPA 2014), Siem Reap, Cambodia, December 2014 (invited paper).
The document discusses using machine learning algorithms to estimate the periods of variable stars from their light curves. It reviews existing period estimation methods and the problem of discriminating the true period from spurious periods. It then explores using a machine learning approach, focusing on extracting features from light curves and using algorithms like Prism to classify samples and determine periods. The preliminary experiments showed Prism had the highest performance but challenges remained around feature selection and noise elimination preprocessing.
Time-delayed collective flow diffusion models for inferring latent people flo...Shun Kojima
This document summarizes a research paper that proposes a new model called Time-delayed Collective Flow Diffusion Models (T-CFDM) to infer latent people flow from limited aggregated location data. T-CFDM extends previous Collective Flow Diffusion Models (CFDM) by incorporating a probability distribution of travel time between areas and improving robustness to noisy data. An efficient EM algorithm is also proposed to simultaneously optimize parameters like movement probabilities, numbers, time distributions, and noise variance. The accuracy of the approach is validated on real-world datasets of exhibitor foot traffic and New York City bicycle and taxi flows.
Large scale cell tracking using an approximated Sinkhorn algorithmParth Nandedkar
Cell tracking for a large scale (of over 1 million cells) has not yet been achievable within reasonable a time scope with current NN/RNN/Bi-RNN based methods. This individual research conducted by me at Osaka University, ISIR seeks to solve this problem using the Sinkhorn algorithm, and taking inspiration from the MPM method (Hayashida, 2020)
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
MiLeTS'19: Enumerating Hub Motifs in Time Series Based on the Matrix Profile
1. Enumerating Hub Motifs
in Time Series
Based on the Matrix Profile
1 National Institute of Advanced Industrial Science and Technology (AIST)
2 Mitsubishi Electric Corporation
3 LeapMind Inc.
5th Workshop on Mining and Learning from Time Series (MiLeTS’19)
Aug 5, 2019 - Anchorage, Alaska, USA
Genta Yoshimura1,2 Atsunori Kanemura1,3 Hideki Asoh1
2. Outline
1. Introduction
• Motif Enumeration
• Problems in Existing Methods
2. Method
• Novel Motif Definition: Hub Motif
• Proposed Method: HubFinder
3. Experiments
• Synthetic Data
• Human Motion Data
4. Conclusion
• Summary
2G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
3. Outline
1. Introduction
• Motif Enumeration
• Problems of Existing Methods
2. Method
• Novel Motif Definition: Hub Motif
• Proposed Method: HubFinder
3. Experiments
• Synthetic Data
• Human Motion Data
4. Conclusion
• Summary
3G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
4. Motif Enumeration from Time Series
Motif = a subsequence that occurs frequently in time series
• Finding motifs is useful for many time series mining tasks
• Classification, forecasting, segmentation, anomaly detection, …
Motif Enumeration
• Enumerate multiple motifs in order of significance
rather than fining a single motif
• Most time series include multiple patterns
• In our problem setting, motif length W is a tunable parameter
• Not variable-length motifs, but fixed-length motifs
4G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
Introduction
Time series
Time
Motif 1 Motif 2
W W W W
5. The difference of two definitions arises from
how we regard a subsequence as significant
1. Range motif
• A subsequence is significant
if there exist many subsequences
inside the sphere of radius R
2. Closest-pair motif
• A subsequence is significant
if the distance to its closest
subsequence is small
Note
• Z-normalized Euclidean Distance (ED) is used as subsequence distance
• Trivial-matches are ignored when finding neighbor subsequences
n1=6 n2=3
Existing Two Motif Definitions
5G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
Introduction
>significant
>significant
d1=0.63 d2=0.89
Subsequence whose length is W
R
6. Existing methods [Bagnall+14] require a radius parameter R
• Place spheres of radius R so as not to overlap each other
• Iteratively find the most significant subsequence as motif
and remove subsequences inside the sphere of radius R
1. SetFinder = Range motif based method
2. ScanMK = Closest-pair motif based method
Existing Motif Enumeration Methods
6G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
Introduction
R
R ×
argmaxi ni
argmini di
argmaxi ni
argmini di
remove
remove
remove
remove
7. Problems in Existing Methods
Existing methods suffer from the radius parameter R
1. It is not easy to tune R
• Appropriate parameter R changes
in accordance with the target dataset
• We cannot even know which R is appropriate
in most real applications where no ground truth is available
2. There are cases where the existing methods fail to
enumerate motifs successfully no matter how finely tune R
• Such cases can be easily made and actually occur in real datasets
Novel motif enumeration method is necessary
7G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
Introduction
R
R
R
?
Too small…Too large…
8. Outline
1. Introduction
• Motif Enumeration
• Problems of Existing Methods
2. Method
• Novel Motif Definition: Hub Motif
• Proposed Method: HubFinder
3. Experiments
• Synthetic Data
• Human Motion Data
4. Conclusion
• Summary
8G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
9. Novel Motif Definition: Hub Motif
In order to get free from the radius parameter R
1. Range motif
• A subsequence is significant
if there exist many subsequences
with in the sphere of radius R
2. Closest-pair motif
• A subsequence is significant
if the distance to its closest
subsequence is small
3. Hub motif
• A subsequence is significant
if a sum of distances from
other subsequences is small
• Looks like a wheel hub
9G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
Method
R
d1=0.63 d2=0.89
n1=6 n2=3
>significant
Σk dik=5.12 Σk djk=7.36
10. Proposed Method: HubFinder
HubFinder does not require the radius parameter R
• Motif length: W
• Number of motifs: K
HubFinder consists of two steps
1. Extract candidates for motifs using the matrix profile
2. Refine candidates into K motifs according to the hub motif significance
10G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
Method
Time series
Candidates
1. Extract
2. Refine
Time
W
K Motifs
・・・
K
11. Step 1. Extract candidates for motifs
• Time series:
• -th subsequence:
• Matrix profile:
• is z-normalized Euclidean Distance (ED)
between and its closest subsequence
except its trivial matches
• Can be computed efficiently using STOMP algorithm [Zhu+16]
• is a candidate of motifs if is a local minimum of
• Use a sliding window whose length is to detect local minima
• Extracted candidates are added to a candidate set
11G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
closest-pair
Time series X
Matrix profile P STOMP
Method
is a local minimum in a sliding window ⇒
12. Step 2. Refine candidates into K motifs
• Refine the candidate set into a motif set
• Cost function based on the hub motif definition
• Find which minimizes the cost function in greedy manner
• New candidate is added to one by one
• If , the least significant candidate is removed
MiLeTS'19 G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix Profile 12
Method
Motif set = { }Candidate set = { }
13. Outline
1. Introduction
• Motif Enumeration
• Problems of Existing Methods
2. Method
• Novel Motif Definition: Hub Motif
• Proposed Method: HubFinder
3. Experiments
• Synthetic Data
• Human Motion Data
4. Conclusion
• Summary
13G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
14. Synthetic Data
Two motifs of length W=32 are arranged alternately
• Motif-1: z-normalized triangular wave + Gaussian noise
• Motif-2: z-normalized sine wave + Gaussian noise
MiLeTS'19 G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix Profile 14
Experiments
・・・ x50 (T=9616)
15. Apply ScanMK, SetFinder, and HubFinder with W=32
• HubFinder succeeds in finding alternate motifs perfectly without tuning R
• Existing methods fail no matter how finely you tune R
• Existing methods are sensitive to R
Synthetic Data (Result)
MiLeTS'19 G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix Profile 15
Experiments
58.5% (R=0.96)
69.0% (R=0.86)
100% (constant)
Purity(thelarger,thebetter)
Radius
Extracted 2nd motif and neighbors
Extracted 1st motif and neighbors
16. Human Motion Data
MotionSense Dataset [Malekzadeh+18]
• Collected with an iPhone 6s kept in the participant's front pocket
• Include 3D accelerometer time series of human motion
• Total of 24 participants performed several activities
• 4 activities were chosen for this study
MiLeTS'19 G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix Profile 16
Experiments
Downstairs Upstairs
Walking Jogging
x
y
z
x
y
z
x
y
z
x
y
z
17. Human Motion Data (Result)
Apply ScanMK, SetFinder, and HubFinder with W=64
• Position of top-4 motifs ( ) and neighbors ( ) of participant #23
• Existing methods fail to find motif from Downstairs activity
• HubFinder successfully finds motifs from all 4 activities
MiLeTS'19 G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix Profile 17
Experiments
Downstairs Upstairs Walking Jogging
ScanMKSetFinderhubFinder
18. Outline
1. Introduction
• Motif Enumeration
• Problems of Existing Methods
2. Method
• Novel Motif Definition: Hub Motif
• Proposed Method: HubFinder
3. Experiments
• Synthetic Data
• Human Motion Data
4. Conclusion
• Summary
18G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
19. Summary
• Problems in existing motif enumeration methods caused by R
• Novel hub motif definition and HubFinder algorithm
• HubFinder succeeds in finding appropriate motifs without tuning R
• Existing methods fail no matter how finely tune R
Future Work
• Remove the motif length parameter W
(Extend to variable-length motifs)
• Utilize extracted motifs for other time series mining tasks
such as classification, forecasting, segmentation, and anomaly detection
Python code is available at
https://github.com/intellygenta/HubFinder
Thank you for your attention!!
19G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix ProfileMiLeTS'19
Conclusion
20. References
[Bagnall+14] A. Bagnall, J. Hills, and J. Lines,
“Finding Motif Sets in Time Series”
arXiv:1407.3685 (2014).
[Zhu+16] Y. Zhu, Z. Zimmerman, N. S. Senobari, C. C. M. Yeh,
G. Funning, A. Mueen, P. Brisk, and E. Keogh,
“Matrix Profile II: Exploiting a Novel Algorithm and GPUs to
Break the One Hundred Million Barrier for Time Series Motifs and Joins“
IEEE 16th International Conference on Data Mining (ICDM), 739–748. (2016)
[Malekzadeh+18] M. Malekzadeh, R. G. Clegg, A. Cavallaro, and H. Haddadi,
“Protecting sensory data against sensitive inferences”
Workshop on Privacy by Design in Distributed Systems (2018).
MiLeTS'19 G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix Profile 20
21. Purity
• Motif enumeration is similar to the clustering task to some extent
• Find representative patterns within a dataset in unsupervised manner
• We adopt purity as evaluation metric for this study
• One of the most popular metric for the clustering
•
• Ground truth motif clusters:
• Enumerated motif clusters:
• E.g. Purity = (5 + 1) / (5 + 5) = 0.60
MiLeTS'19 G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix Profile 21
ψ1 ψ1
∈
∈
∈
∈
∈
∈
∈
∈
∈
∈
ψ1 ψ1 ψ1ψ2 ψ2 ψ2 ψ2 ψ2
ω1
∈
∈
∈
∈
∈
∈
ω2 ω1 ω1 ω1 ω1
∈
ω1
Ground truth
Enumerated
22. Time Complexity
• Running times on the synthetic time series
• HubFinder is faster than the existing methods for long time series
• HubFinder does not need multiple trials for tuning R
MiLeTS'19 G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix Profile 22
23. Dependency of the Number of Motifs K
Dependency of the number of motifs K on the purity for the synthetic data
• Blue and orange lines: ScanMK and SetFinder for their optimal radius R
• Gray lines: Existing methods with non-optimal radii
• HubFinder outperforms existing methods for all K and R
MiLeTS'19 G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix Profile 23
24. Human Motion Data (Result for All Participants)
HubFinder outperforms existing methods in terms of purity metric
MiLeTS'19 G. Yoshimura et al. Enumerating Hub Motifs in Time Series Based on the Matrix Profile 24
ScanMK/SetFinder Purity (with the best radius R)
HubFinderPurity(withouttuningR)