We consider the problem of finding anomalies in high-dimensional data using popular PCA based anomaly scores. The naive algorithms for computing these scores explicitly compute the PCA of the covariance matrix which uses space quadratic in the dimensionality of the data. We give the first streaming algorithms
that use space that is linear or sublinear in the dimension. We prove general results showing that any sketch of a matrix that satisfies a certain operator norm guarantee can be used to approximate these scores. We instantiate these results with powerful matrix sketching techniques such as Frequent Directions and random projections to derive efficient and practical algorithms for these problems, which we validate over real-world data sets. Our main technical contribution is to prove matrix perturbation
inequalities for operators arising in the computation of these measures.
-Proceedings: https://arxiv.org/abs/1804.03065
-Origin: https://arxiv.org/abs/1804.03065
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...Analytics India Magazine
We started relying on the decisions made by deep learning models, however why it works and how it works are still big questions for most of us. We shall try to open that black box of deep learning which is essential to build trust for wide spread adoption. The speaker shall address the importance of feature visualization and localization in deep learning models esp. convolutional neural networks. He shares the results of applying methods such as activation map, deconvolution and Grad-CAM in healthcare.
A STUDY AND ANALYSIS OF DIFFERENT EDGE DETECTION TECHNIQUEScscpconf
In the first study [1], a combination of K-means, watershed segmentation method, and Difference In Strength (DIS) map were used to perform image segmentation and edge detection
tasks. We obtained an initial segmentation based on K-means clustering technique. Starting from this, we used two techniques; the first is watershed technique with new merging
procedures based on mean intensity value to segment the image regions and to detect their boundaries. The second is edge strength technique to obtain accurate edge maps of our images without using watershed method. In this technique: We solved the problem of undesirable over segmentation results produced by the watershed algorithm, when used directly with raw data images. Also, the edge maps we obtained have no broken lines on entire image. In the 2nd study level set methods are used for the implementation of curve/interface evolution under various forces. In the third study the main idea is to detect regions (objects) boundaries, to isolate and extract individual components from a medical image. This is done using an active contours to detect regions in a given image, based on techniques of curve evolution, Mumford–Shah functional for segmentation and level sets. Once we classified our images into different intensity regions based on Markov Random Field. Then we detect regions whose boundaries are not necessarily defined by gradient by minimize an energy of Mumford–Shah functional forsegmentation, where in the level set formulation, the problem becomes a mean-curvature which will stop on the desired boundary. The stopping term does not depend on the gradient of the image as in the classical active contour. The initial curve of level set can be anywhere in the image, and interior contours are automatically detected. The final image segmentation is one
closed boundary per actual region in the image.
An evaluation of two popular segmentation algorithms, the mean shift-based segmentation algorithm and a graph-based segmentation scheme. We also consider a hybrid method which combines the other two methods.
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...Analytics India Magazine
We started relying on the decisions made by deep learning models, however why it works and how it works are still big questions for most of us. We shall try to open that black box of deep learning which is essential to build trust for wide spread adoption. The speaker shall address the importance of feature visualization and localization in deep learning models esp. convolutional neural networks. He shares the results of applying methods such as activation map, deconvolution and Grad-CAM in healthcare.
A STUDY AND ANALYSIS OF DIFFERENT EDGE DETECTION TECHNIQUEScscpconf
In the first study [1], a combination of K-means, watershed segmentation method, and Difference In Strength (DIS) map were used to perform image segmentation and edge detection
tasks. We obtained an initial segmentation based on K-means clustering technique. Starting from this, we used two techniques; the first is watershed technique with new merging
procedures based on mean intensity value to segment the image regions and to detect their boundaries. The second is edge strength technique to obtain accurate edge maps of our images without using watershed method. In this technique: We solved the problem of undesirable over segmentation results produced by the watershed algorithm, when used directly with raw data images. Also, the edge maps we obtained have no broken lines on entire image. In the 2nd study level set methods are used for the implementation of curve/interface evolution under various forces. In the third study the main idea is to detect regions (objects) boundaries, to isolate and extract individual components from a medical image. This is done using an active contours to detect regions in a given image, based on techniques of curve evolution, Mumford–Shah functional for segmentation and level sets. Once we classified our images into different intensity regions based on Markov Random Field. Then we detect regions whose boundaries are not necessarily defined by gradient by minimize an energy of Mumford–Shah functional forsegmentation, where in the level set formulation, the problem becomes a mean-curvature which will stop on the desired boundary. The stopping term does not depend on the gradient of the image as in the classical active contour. The initial curve of level set can be anywhere in the image, and interior contours are automatically detected. The final image segmentation is one
closed boundary per actual region in the image.
An evaluation of two popular segmentation algorithms, the mean shift-based segmentation algorithm and a graph-based segmentation scheme. We also consider a hybrid method which combines the other two methods.
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION cscpconf
Texture is the term used to characterize the surface of a given object or phenomenon and is an
important feature used in image processing and pattern recognition. Our aim is to compare
various Texture analyzing methods and compare the results based on time complexity and
accuracy of classification. The project describes texture classification using Wavelet Transform
and Co occurrence Matrix. Comparison of features of a sample texture with database of
different textures is performed. In wavelet transform we use the Haar, Symlets and Daubechies
wavelets. We find that, thee ‘Haar’ wavelet proves to be the most efficient method in terms of
performance assessment parameters mentioned above. Comparison of Haar wavelet and Cooccurrence
matrix method of classification also goes in the favor of Haar. Though the time
requirement is high in the later method, it gives excellent results for classification accuracy
except if the image is rotated.
Convolutional neural network (CNN / ConvNet's) is a part of Computer Vision. Machine Learning Algorithm. Image Classification, Image Detection, Digit Recognition, and many more. https://technoelearn.com .
Chen, X., & He, K. (2021). Exploring Simple Siamese Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15750-15758).
Data-Driven Motion Estimation With Spatial AdaptationCSCJournals
The pel-recursive computation of 2-D optical flow raises a wealth of issues, such as the treatment of outliers, motion discontinuities and occlusion. Our proposed approach deals with these issues within a common framework. It relies on the use of a data-driven technique called Generalised Cross Validation to estimate the best regularisation scheme for a given pixel. In our model, the regularisation parameter is a general matrix whose entries can account for different sources of error. The motion vector estimation takes into consideration local image properties following a spatially adaptive approach where each moving pixel is supposed to have its own regularisation matrix. Preliminary experiments indicate that this approach provides robust estimates of the optical flow.
Image segmentation techniques
More information on this research can be found in:
Hussein, Rania, Frederic D. McKenzie. “Identifying Ambiguous Prostate Gland Contours from Histology Using Capsule Shape Information and Least Squares Curve Fitting.” The International Journal of Computer Assisted Radiology and Surgery ( IJCARS), Volume 2 Numbers 3-4, pp. 143-150, December 2007.
WAVELET BASED AUTHENTICATION/SECRET TRANSMISSION THROUGH IMAGE RESIZING (WA...sipij
The paper is aimed for a wavelet based steganographic/watermarking technique in frequency domain
termed as WASTIR for secret message/image transmission or image authentication. Number system
conversion of the secret image by changing radix form decimal to quaternary is the pre-processing of the
technique. Cover image scaling through inverse discrete wavelet transformation with false Horizontal and
vertical coefficients are embedded with quaternary digits through hash function and a secret key.
Experimental results are computed and compared with the existing steganographic techniques like WTSIC,
Yuancheng Li’s Method and Region-Based in terms of Mean Square Error (MSE), Peak Signal to Noise
Ratio (PSNR) and Image Fidelity (IF) which show better performances in WASTIR.
We propose an efficient algorithmic framework for time domain circuit simulation using exponential integrators. This work addresses several critical issues exposed by previous matrix exponential based circuit simulation research, and makes it capable of simulating stiff nonlinear circuit system at a large scale. In this framework, the system’s nonlinearity is treated with exponential Rosenbrock-Euler formulation. The matrix exponential and vector product is computed using invert Krylov subspace method. Our proposed method has several distinguished advantages over conventional formulations (e.g., the well-known backward Euler with Newton-Raphson method). The matrix factorization is performed only for the conductance/resistance matrix G, without being performed for the combinations of the capacitance/inductance matrix C and matrix G, which are used in traditional implicit formulations. Furthermore, due to the explicit nature of our formulation, we do not need to repeat LU decompositions when adjusting the length of time steps for error controls. Our algorithm is better suited to solving tightly coupled post-layout circuits in the pursuit for full-chip simulation. Our experimental results validate the advantages of our framework.
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION cscpconf
Texture is the term used to characterize the surface of a given object or phenomenon and is an
important feature used in image processing and pattern recognition. Our aim is to compare
various Texture analyzing methods and compare the results based on time complexity and
accuracy of classification. The project describes texture classification using Wavelet Transform
and Co occurrence Matrix. Comparison of features of a sample texture with database of
different textures is performed. In wavelet transform we use the Haar, Symlets and Daubechies
wavelets. We find that, thee ‘Haar’ wavelet proves to be the most efficient method in terms of
performance assessment parameters mentioned above. Comparison of Haar wavelet and Cooccurrence
matrix method of classification also goes in the favor of Haar. Though the time
requirement is high in the later method, it gives excellent results for classification accuracy
except if the image is rotated.
Convolutional neural network (CNN / ConvNet's) is a part of Computer Vision. Machine Learning Algorithm. Image Classification, Image Detection, Digit Recognition, and many more. https://technoelearn.com .
Chen, X., & He, K. (2021). Exploring Simple Siamese Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15750-15758).
Data-Driven Motion Estimation With Spatial AdaptationCSCJournals
The pel-recursive computation of 2-D optical flow raises a wealth of issues, such as the treatment of outliers, motion discontinuities and occlusion. Our proposed approach deals with these issues within a common framework. It relies on the use of a data-driven technique called Generalised Cross Validation to estimate the best regularisation scheme for a given pixel. In our model, the regularisation parameter is a general matrix whose entries can account for different sources of error. The motion vector estimation takes into consideration local image properties following a spatially adaptive approach where each moving pixel is supposed to have its own regularisation matrix. Preliminary experiments indicate that this approach provides robust estimates of the optical flow.
Image segmentation techniques
More information on this research can be found in:
Hussein, Rania, Frederic D. McKenzie. “Identifying Ambiguous Prostate Gland Contours from Histology Using Capsule Shape Information and Least Squares Curve Fitting.” The International Journal of Computer Assisted Radiology and Surgery ( IJCARS), Volume 2 Numbers 3-4, pp. 143-150, December 2007.
WAVELET BASED AUTHENTICATION/SECRET TRANSMISSION THROUGH IMAGE RESIZING (WA...sipij
The paper is aimed for a wavelet based steganographic/watermarking technique in frequency domain
termed as WASTIR for secret message/image transmission or image authentication. Number system
conversion of the secret image by changing radix form decimal to quaternary is the pre-processing of the
technique. Cover image scaling through inverse discrete wavelet transformation with false Horizontal and
vertical coefficients are embedded with quaternary digits through hash function and a secret key.
Experimental results are computed and compared with the existing steganographic techniques like WTSIC,
Yuancheng Li’s Method and Region-Based in terms of Mean Square Error (MSE), Peak Signal to Noise
Ratio (PSNR) and Image Fidelity (IF) which show better performances in WASTIR.
We propose an efficient algorithmic framework for time domain circuit simulation using exponential integrators. This work addresses several critical issues exposed by previous matrix exponential based circuit simulation research, and makes it capable of simulating stiff nonlinear circuit system at a large scale. In this framework, the system’s nonlinearity is treated with exponential Rosenbrock-Euler formulation. The matrix exponential and vector product is computed using invert Krylov subspace method. Our proposed method has several distinguished advantages over conventional formulations (e.g., the well-known backward Euler with Newton-Raphson method). The matrix factorization is performed only for the conductance/resistance matrix G, without being performed for the combinations of the capacitance/inductance matrix C and matrix G, which are used in traditional implicit formulations. Furthermore, due to the explicit nature of our formulation, we do not need to repeat LU decompositions when adjusting the length of time steps for error controls. Our algorithm is better suited to solving tightly coupled post-layout circuits in the pursuit for full-chip simulation. Our experimental results validate the advantages of our framework.
Anomaly detection using deep one class classifier홍배 김
- Anomaly detection의 다양한 방법을 소개하고
- Support Vector Data Description (SVDD)를 이용하여
cluster의 모델링을 쉽게 하도록 cluster의 형상을 단순화하고
boundary근방의 애매한 point를 처리하는 방법 소개
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
Paper Study: Melding the data decision pipelineChenYiHuang5
Melding the data decision pipeline: Decision-Focused Learning for Combinatorial Optimization from AAAI2019.
Derive the math equation from myself and match the same result as two mentioned CMU papers [Donti et. al. 2017, Amos et. al. 2017] while applying the same derivation procedure.
Credit : Nusrat Jahan & Fahima Hossain , Dept. of CSE, JnU, Dhaka.
Randomized Algorithm- Advanced Algorithm, Deterministic, Non Deterministic, LAS Vegas, MONTE Carlo Algorithm.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
Efficient anomaly detection via matrix sketching
1. Efficient Anomaly Detection
via Matrix Sketching
32nd Conference on Neural Information Processing Systems
(NeurIPS 2018), Montréal, Canada.
PPT maker: 謝幸娟 (Hsing-Chuan Hsieh)
connie0915549431@hotmail.com
Vatsal Sharan
Stanford University
vsharan@stanford.edu
Parikshit Gopalan
VMware Research
pgopalan@vmware.com
Udi Wieder
VMware Research
uwieder@vmware.com
3. 1. Introduction
• Context
Anomaly detection (AD) in high-dimensional numeric (streaming)
data is a ubiquitous problem in machine learning
• E.g., parameters regarding the health of machines in a data-center
Popular PCA/subspace based AD is used
• E.g. of anomaly scores: rank-k leverage scores , rank-k projection distance…
4. 1. Introduction
• Computational challenge
Dimension of the data matrix 𝐴 ∈ 𝑅 𝑛×𝑑 may be very large
• E.g. of the data center: 𝑑 ≈ 106 and 𝑛 ≫ 𝑑
The quadratic dependence on d renders standard approach of AD
inefficient in high dimensions
• Note: the standard approach to compute anomaly scores from k PC’s
in a streaming setting:
1. Compute covariance matrix 𝐴 𝑇 𝐴 ∈ 𝑅 𝑑×𝑑 space ~ 𝑂(𝑑2) ; time ~ 𝑂(𝑛𝑑2)
2. then compute the top k PC’s for anomaly scores
5. 1. Introduction
• Requirement for an algorithm to be efficient
1. As n too large:
the algorithm must work in a streaming fashion where it only gets a
constant number of passes over the dataset stored in memory.
2. As d too large:
the algorithm should ideally use memory linear (i.e. 𝑂(𝑑)) or even
sublinear in d (e.g. 𝑂 log(𝑑) )
6. 1. Introduction
• Purpose
1. Provide simple and practical algorithms at a significantly lower cost in
terms of time and memory. I.e., 𝑂(𝑑2) 𝑂(𝑑) or 𝑂 log(𝑑)
using popular matrix sketching techniques: 𝐴 ∈ 𝑅 𝑛×𝑑 𝐴 ∈ 𝑅 𝑙×𝑑 l ≪ n or 𝐴 ∈
𝑅 𝑛×𝑙
l ≪ 𝑑
𝐴 preserves some desirable properties of the large matrix 𝐴
2. Prove that estimated subspace-based anomaly scores computed from
𝐴 approximate the true anomaly scores
9. 2. Notation and Setup
• Subspace based measures of anomalies for each instance 𝑎(𝑖) ∈ 𝑅 𝑑
Mahalanobis distance/ leverage score:
𝐿 𝑖 =
𝑗=1
𝑑
(𝑎 𝑖
𝑇
𝑣 𝑗
/𝜎𝑖)2
• 𝐿 𝑖 is highly sensitive to smaller singular values
Rank k leverage score (𝑘 ≪ 𝑑):
𝐿 𝑘 𝑖 =
𝑗=1
𝑘
(𝑎 𝑖
𝑇
𝑣 𝑗 )2/𝜎𝑗
2
• real world data sets often have most of their signal in the top singular values
Rank k projection distance:
𝑇 𝑘
𝑖 =
𝑗=𝑘+1
𝑑
(𝑎 𝑖
𝑇
𝑣 𝑗
)2
• to catch anomalies that are far from the principal subspace
10. 2. Notation and Setup
• Subspace based measures of anomalies for each instance 𝑎(𝑖) ∈ 𝑅 𝑑
Illustration of Rank k leverage score/projection distance
11. 2. Notation and Setup
• Assumptions
1. Separation assumption
In case of degeneracy (i.e., 𝜎 𝑘
2
= 𝜎 𝑘+1
2
) 𝐿 𝑘
𝑖 , 𝑇 𝑘
𝑖 misdefined
2. Approximate low-rank assumption
that the top-k principal subspace captures a constant fraction (at least 0.1) of the total variance in
the data
• Assumption 2 captures the setting where the scores 𝐿 𝑘and 𝑇 𝑘 are most meaningful.
• Our algorithms work best on data sets where relatively few principal components explain most of
the variance.
12. 3. Guarantees for anomaly detection via sketching
• Our main results say that given 𝜇 > 0 and a (k, ∆)-separated matrix A
∈ 𝑅 𝑛×𝑑 with top singular value 𝜎1, any sketch 𝐴 ∈ 𝑅 𝑙×𝑑 satisfying
𝐴 𝑇 𝐴 − 𝐴 𝑇 𝐴 ≤ 𝜇𝜎1
2
or a sketch 𝐴 ∈ 𝑅 𝑛×𝑙 satisfying
𝐴𝐴 𝑇
− 𝐴 𝐴 𝑇
≤ 𝜇𝜎1
2
can be used to approximate rank k leverage scores and the projection
distance from the principal k-dimensional subspace.
• Explanation:
1. Equation (2): an approximation to the covariance matrix of the row vectors
2. Equation (3): an approximation to the covariance matrix of the column
vectors.
(2)
(3)
d× 𝑑
𝑛 × 𝑛
13. 3. Guarantees for anomaly detection via sketching
• Next, we show how to design efficient algorithms for finding
anomalies in a streaming fashion.
• Pointwise guarantees from row space approximations
Algorithm 1: random column projection/ row space approximation
• Sketching techs: Frequent Directions sketch [18], row-sampling [19], random
column projection, …
(𝑛 → 𝑙)
14. 3. Guarantees for anomaly detection via sketching
•Pointwise guarantees from row space approximations
• Thm 1 improves on the trivial 𝑂(𝑑2) bound of memory/ time
𝑂(𝑑) 𝑂(𝑛𝑑) ,since 𝑙 is independent of d
15. 3. Guarantees for anomaly detection via sketching
• Next we show that substantial savings are unlikely for any algorithm
with strong pointwise guarantees:
there is an Ω(d) lower bound for any approximation
16. 3. Guarantees for anomaly detection via sketching
• Average-case guarantees from columns space approximations
Even though the sketch gives column space approximations (i.e.
𝐴𝐴 𝑇 approximation satisfying Equ. (3))
Our goal is still to compute the row anomaly scores from 𝐴 𝑇 𝐴
• E.g., by random matrix 𝑹 ∈ 𝑹 𝑑×𝑙 s.t. 𝑟𝑖𝑗 ~
𝑖.𝑖.𝑑 𝑈{±1}) where 𝑙 = 𝑂(𝑘/𝜇2) [27]
sketch 𝑨 = 𝑨𝑹 for which returns the anomaly scores
Space consumption
• For 𝐴 𝑇
𝐴 𝑂 𝑙2
• For 𝑹 able to be pseudorandom [28-30] 𝑂 log(𝑑)
17. 3. Guarantees for anomaly detection via sketching
• Average-case guarantees from columns space approximations
• Algorithm 2: random row projection/ column space approximation
18. 3. Guarantees for anomaly detection via sketching
• Average-case guarantees from columns space approximations
• i.e., on average random projections can preserve leverage scores and
distances from the principal subspace
• 𝑙 = poly(k, 𝜀−1, ∆), indep. of both n and d.
19. 4. Experimental evaluation
• Goal
1. to test whether our algorithms give comparable results to exact anomaly
score computation based on full SVD.
2. to determine how large the parameter 𝑙 (determining the size of the sketch)
needs to be to get close to the exact scores
Note: The experiment is in backend mode (i.e. not online mode)
20. 4. Experimental evaluation
• Data: (1) p53 mutants [32], (2) Dorothea [33], (3) RCV1 [34]
All available from the UCI Machine Learning Repository,
• Ground truth: to decide anomalies
• We compute the rand k anomaly scores using a full SVD, and then label the η
fraction of points with the highest anomaly scores to be outliers.
1. k chosen by examining the explained variance of the datatset: typically between (10, 125)
2. η chosen by examining the histogram of the anomaly score; typically between (0.01, 0.1)
21. 4. Experimental evaluation
• Experimental design
• 3 datasets x 2 algorithms x 2 anomaly scores
• Parameters (depending on dataset) :
∴ 3*2*2*3*10=360 combinations
iterate for 5 runs for each combination
• Measuring accuracy
1. After computing anomaly scores for each dataset, we then declare the
points with the top η’ fraction of scores to be anomalies (without knowing
η)
2. Then compute the F1 score
Note: we choose the value of η’ which maximizes the F1 score
3. Report the average F1 score over 5 runs
1. k: 3 settings 2. 𝑙: 10 settings ranging (2k, 20k)
25. 4. Experimental evaluation
• Takeaways:
1. Taking 𝑙 = Ck with a fairly small C ≈ 10 suffices to get F1 scores >
0.75 in most settings
2. Algorithm 1 generally outperforms Algorithm 2 for a given value of 𝑙
26. 5. Conclusion
1. Any sketch 𝐴 ∈ 𝑅 𝑙×𝑑of A with the property that 𝐴 𝑇 𝐴 − 𝐴 𝑇 𝐴 is
small, can be used to additively approximate the 𝐿 𝑘 and 𝑇 𝑘 for each
row. We get a streaming algorithm that uses O(d) memory and O(nd)
time.
2. Q: Can we get such an additive approximation using memory only
𝑂(𝑑)?
1. The answer is no.
2. Lower bound: must use Ω(d) working space for approximating the outlier
scores for every data point
27. 5. Conclusion
3. Using random row random projection, we give a streaming
algorithm that can preserve the outlier scores for the rows up to
small additive error on average
• and preserve most outliers. The space required by this algorithm is only
poly(k)log(d).
4. In our experiments, we found that choosing 𝑙 to be a small multiple
of k was sufficient to get good results.
1. Our results comparable full-blown SVD using sketches
2. significantly smaller in memory footprint
3. faster to compute and easy to implement
28. • 老師回饋
1. Code on Github?
2. SVD tool package
3. How to apply online-AD