The document discusses techniques for creating small summaries of big data in order to improve computational scalability. It introduces sketch structures as a class of linear summaries that can be merged and updated efficiently. Specific sketch structures discussed include Bloom filters, Count-Min sketches, and Count sketches. It also covers counter-based summaries like the heavy hitters algorithm for finding frequent items in a data stream. The document outlines the structures, analysis, and applications of these various techniques for creating concise summaries of large datasets.
1) The document discusses algorithms for computing statistics like minimum, maximum, average over data streams using limited memory in a single pass. It covers algorithms for computing cardinality, heavy hitters, order statistics and histograms.
2) Cardinality can be estimated using the Flajolet-Martin algorithm which tracks the position of the rightmost zero bit in a bitmap. Heavy hitters can be found using the Count-Min sketch. Order statistics like the median can be approximated using the Frugal and T-Digest algorithms. Wavelet-based approaches can be used to compute histograms over data streams.
3) The document provides high-level explanations of these streaming algorithms along with references for further reading, but does not
The document discusses an algorithms analysis and design course. The major objectives are to design and analyze modern algorithms, compare their efficiencies, and solve real-world problems. Students will learn to prove algorithm correctness, analyze running times, and apply techniques like dynamic programming and graph algorithms. While algorithms can differ in efficiency, even on faster hardware, the computational model used makes reasonable assumptions for comparing algorithms asymptotically.
Principal component analysis - application in financeIgor Hlivka
Principal component analysis is a useful multivariate times series method to examine and study the drivers of the changes in the entire dataset. The main advantage of PCA is the reduction of dimensionality where the large sets of data get transformed into few principal factors that explain majority of variability in that group. PCA has found many applications in finance – both in risk and yield curve analytics
Streaming algorithms aim to process massive data streams with very limited memory. They can approximate calculations like minimum, maximum, average and quantiles in one pass over the data using hashing, sketching and other techniques. Key algorithms discussed include HyperLogLog for cardinality, Flajolet-Martin sketch for distinct items, Count-Min sketch for heavy hitters, and Frugal streaming for order statistics like median and quantiles using only logarithmic memory. T-Digest is also summarized as a way to find quantiles in streaming data using a balanced binary tree of centroids.
Using information theory principles to schedule real time tasksazm13
This document presents a new scheduling algorithm called ITS-RT that uses information theory principles to schedule real-time tasks. ITS-RT selects the task with the highest amount of information per studied interval to schedule. It is shown to have slightly better performance than EDF in terms of average number of context switches and preemptions for the task sets studied. The document defines information theory concepts used in ITS-RT like probability, entropy, and information of tasks. It also presents the design, feasibility analysis, and performance comparison of ITS-RT against EDF.
Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of data by transforming it to a new coordinate system. It works by finding the principal components - linear combinations of variables with the highest variance - and using those to project the data to a lower dimensional space. PCA is useful for visualizing high-dimensional data, reducing dimensions without much loss of information, and finding patterns. It involves calculating the covariance matrix and solving the eigenvalue problem to determine the principal components.
1) The document discusses algorithms for computing statistics like minimum, maximum, average over data streams using limited memory in a single pass. It covers algorithms for computing cardinality, heavy hitters, order statistics and histograms.
2) Cardinality can be estimated using the Flajolet-Martin algorithm which tracks the position of the rightmost zero bit in a bitmap. Heavy hitters can be found using the Count-Min sketch. Order statistics like the median can be approximated using the Frugal and T-Digest algorithms. Wavelet-based approaches can be used to compute histograms over data streams.
3) The document provides high-level explanations of these streaming algorithms along with references for further reading, but does not
The document discusses an algorithms analysis and design course. The major objectives are to design and analyze modern algorithms, compare their efficiencies, and solve real-world problems. Students will learn to prove algorithm correctness, analyze running times, and apply techniques like dynamic programming and graph algorithms. While algorithms can differ in efficiency, even on faster hardware, the computational model used makes reasonable assumptions for comparing algorithms asymptotically.
Principal component analysis - application in financeIgor Hlivka
Principal component analysis is a useful multivariate times series method to examine and study the drivers of the changes in the entire dataset. The main advantage of PCA is the reduction of dimensionality where the large sets of data get transformed into few principal factors that explain majority of variability in that group. PCA has found many applications in finance – both in risk and yield curve analytics
Streaming algorithms aim to process massive data streams with very limited memory. They can approximate calculations like minimum, maximum, average and quantiles in one pass over the data using hashing, sketching and other techniques. Key algorithms discussed include HyperLogLog for cardinality, Flajolet-Martin sketch for distinct items, Count-Min sketch for heavy hitters, and Frugal streaming for order statistics like median and quantiles using only logarithmic memory. T-Digest is also summarized as a way to find quantiles in streaming data using a balanced binary tree of centroids.
Using information theory principles to schedule real time tasksazm13
This document presents a new scheduling algorithm called ITS-RT that uses information theory principles to schedule real-time tasks. ITS-RT selects the task with the highest amount of information per studied interval to schedule. It is shown to have slightly better performance than EDF in terms of average number of context switches and preemptions for the task sets studied. The document defines information theory concepts used in ITS-RT like probability, entropy, and information of tasks. It also presents the design, feasibility analysis, and performance comparison of ITS-RT against EDF.
Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of data by transforming it to a new coordinate system. It works by finding the principal components - linear combinations of variables with the highest variance - and using those to project the data to a lower dimensional space. PCA is useful for visualizing high-dimensional data, reducing dimensions without much loss of information, and finding patterns. It involves calculating the covariance matrix and solving the eigenvalue problem to determine the principal components.
Principal Component Analysis and ClusteringUsha Vijay
Identifying the borrower segments from the give bank data set which has 27000 rows and 77 variable using PROC PRINCOMP. variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can
be envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the
data based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is
executed with all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the
principal components
IRJET- Different Data Mining Techniques for Weather PredictionIRJET Journal
This document discusses different data mining techniques that can be used for weather prediction, including back propagation, decision trees, k-means clustering, expectation maximization, and numerical and statistical methods. It provides an overview of each technique, explaining the basic process or algorithm involved. For example, it explains that back propagation is a deep learning algorithm that trains multilayer neural networks in two phases - propagation and weight updating. It also discusses how decision trees use rules to classify weather data based on input parameters, and how k-means clustering groups similar weather observations into clusters. The document aims to compare these techniques for applying data mining to weather forecasting.
Principal Component Analysis(PCA) understanding documentNaveen Kumar
PCA is applied to reduce a dataset into fewer dimensions while retaining most of the variation in the data. It works by calculating the covariance matrix of the data and extracting eigenvectors with the highest eigenvalues, which become the principal components. The EJML Java library can be used to perform PCA by adding sample data, computing the basis using eigenvectors, and projecting samples into the reduced eigenvector space. PCA is generally not useful for datasets containing mostly 0s and 1s, as such sparse data is already in a compact format.
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
PPT on Analysis Of Algorithms.
The ppt includes Algorithms,notations,analysis,analysis of algorithms,theta notation, big oh notation, omega notation, notation graphs
This document provides an overview of Six Sigma methodology. It discusses that Six Sigma aims to reduce defects to 3.4 per million opportunities by using statistical methods. The Six Sigma methodology uses the DMAIC process which stands for Define, Measure, Analyze, Improve, and Control. It also outlines several statistical tools used in Six Sigma like check sheets, Pareto charts, histograms, scatter diagrams, and control charts. Process capability and its measures like Cp, Cpk are also defined. The document aims to explain the key concepts and tools used in Six Sigma to improve quality and processes.
"FingerPrint Recognition Using Principle Component Analysis(PCA)”Er. Arpit Sharma
Fingerprint recognition is one of the oldest and most popular biometric technologies and it is used in criminal investigations, civilian, commercial applications, and so on. Fingerprint matching is the process used to determine whether the two sets of fingerprints details come from the same finger or not. This work focuses on feature extraction and minutiae matching stage. There are many matching techniques used for fingerprint recognition systems such as minutiae based matching, pattern based matching, Correlation based matching, and image based matching.
A new method based upon Principal Component Analysis (PCA) for fingerprint enhancement is proposed in this paper. PCA is a useful statistical technique that has found application in fields such as face recognition and image compression, and is a common technique for finding patterns in data of high dimension. In the proposed method image is first decomposed into directional images using decimation free Directional Filter bank DDFB. Then PCA is applied to these directional fingerprint images which gives the PCA filtered images. Which are basically directional images? Then these directional images are reconstructed into one image which is the enhanced one. Simulation results are included illustrating the capability of the proposed method.
This document provides an overview of key mathematical concepts relevant to machine learning, including linear algebra (vectors, matrices, tensors), linear models and hyperplanes, dot and outer products, probability and statistics (distributions, samples vs populations), and resampling methods. It also discusses solving systems of linear equations and the statistical analysis of training data distributions.
This document discusses greedy algorithms and dynamic programming. It explains that greedy algorithms find local optimal solutions at each step, while dynamic programming finds global optimal solutions by considering all possibilities. The document also provides examples of problems solved using each approach, such as Prim's algorithm and Dijkstra's algorithm for greedy, and knapsack problems for dynamic programming. It then discusses the matrix chain multiplication problem in detail to illustrate how a dynamic programming solution works by breaking the problem into overlapping subproblems.
This document provides an overview of demand forecasting and inventory prediction techniques. It discusses the importance of accurate forecasting to ensure sufficient inventory levels. Key elements for successful forecasting include historical data on inventory levels, orders, trends, seasonality, and expected demand. Common forecasting models are explained, including simple exponential smoothing, Holt's linear trend method, and Holt-Winters seasonal method. The document also covers concepts like stationarity, differencing time series data to make it stationary, and using autoregressive integrated moving average (ARIMA) and seasonal ARIMA (SARIMA) models to forecast time series with trends or seasonal patterns. Homework is assigned to further experiment with transforming time series to achieve stationarity
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
This is the documentation of the study-meeting in lab.
Tha book title is "Hands-On Machine Learning with Scikit-Learn and TensorFlow" and this is the chapter 8.
This document discusses the divide and conquer algorithm design strategy and provides an analysis of the merge sort algorithm as an example. It begins by explaining the divide and conquer strategy of dividing a problem into smaller subproblems, solving those subproblems recursively, and combining the solutions. It then provides pseudocode and explanations for the merge sort algorithm, which divides an array in half, recursively sorts the halves, and then merges the sorted halves back together. It analyzes the time complexity of merge sort as Θ(n log n), proving it is more efficient than insertion sort.
External sorting algorithms are needed to sort data that is too large to fit into main memory. Traditional sorting algorithms require the entire input to reside in memory. External sorting techniques divide the large input into chunks that can fit in memory, sort each chunk, and then merge the sorted chunks. The document discusses how external accesses to data on disk are much slower than accessing data in memory, necessitating new algorithms to minimize disk I/O for large datasets that cannot fit in memory.
1. Distributed optimization techniques are needed to train machine learning models on large datasets.
2. Gradient descent and its variants are commonly used optimization methods for training ML models. These include batch gradient descent, stochastic gradient descent, momentum gradient descent, Nesterov accelerated gradient, Adagrad, Adadelta, and RMSprop.
3. Each method has a different approach to updating model parameters in order to minimize an objective function more efficiently. For example, momentum helps overcome oscillations, while Adagrad adapts learning rates for each parameter.
Principal Component Analysis (PCA) is a technique used to simplify complex data sets by identifying patterns in the data and expressing it in such a way to highlight similarities and differences. It works by subtracting the mean from the data, calculating the covariance matrix, and determining the eigenvectors and eigenvalues to form a feature vector representing the data in a lower dimensional space. PCA can be used to represent image data as a one dimensional vector by stacking the pixel rows of an image and applying this analysis to multiple images.
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
Graph-structured data in network security, social networks, finance, and other applications not only are massive but also under continual evolution. The changes often are scattered across the graph, permitting novel parallel and incremental analysis algorithms. We discuss analysis algorithms for streaming graph data to maintain both local and global metrics with low latency and high efficiency.
PCA (Principal Component Analysis) is a technique used to simplify complex data sets by reducing their dimensionality. It transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. The document provides background on concepts like variance, covariance, and eigenvalues that are important to understanding PCA. It also includes an example of using PCA to analyze student data and identify the most important parameters to describe students.
The document discusses hash tables and collision resolution techniques for hash tables. It defines hash tables as an implementation of dictionaries that use hash functions to map keys to array slots. Collisions occur when multiple keys hash to the same slot. Open addressing techniques like linear probing and quadratic probing search the array sequentially for empty slots when collisions occur. Separate chaining creates an array of linked lists so items can be inserted into lists when collisions occur.
This document discusses complexity analysis of algorithms. It defines an algorithm and lists properties like being correct, unambiguous, terminating, and simple. It describes common algorithm design techniques like divide and conquer, dynamic programming, greedy method, and backtracking. It compares divide and conquer with dynamic programming. It discusses algorithm analysis in terms of time and space complexity to predict resource usage and compare algorithms. It introduces asymptotic notations like Big-O notation to describe upper bounds of algorithms as input size increases.
This document provides an overview of the Design and Analysis of Algorithms course. It discusses the closest pair of points problem and provides a divide and conquer algorithm to solve it in O(n log^2 n) time. The algorithm works by recursively dividing the problem into subproblems on left and right halves, computing the closest pairs for each, and then combining results while searching a sorted array to handle point pairs across divisions. Homework includes improving the closest pair algorithm to O(n log n) time and considering a data structure for orthogonal range searching.
The document provides an introduction to the analysis of algorithms. It discusses key concepts like the definition of an algorithm, properties of algorithms, common computational problems, and basic issues related to algorithms. It also covers algorithm design strategies, fundamental data structures, and the fundamentals of analyzing algorithm efficiency. Examples of algorithms for computing the greatest common divisor and checking for prime numbers are provided to illustrate algorithm design and analysis.
Principal Component Analysis and ClusteringUsha Vijay
Identifying the borrower segments from the give bank data set which has 27000 rows and 77 variable using PROC PRINCOMP. variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can
be envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the
data based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is
executed with all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the
principal components
IRJET- Different Data Mining Techniques for Weather PredictionIRJET Journal
This document discusses different data mining techniques that can be used for weather prediction, including back propagation, decision trees, k-means clustering, expectation maximization, and numerical and statistical methods. It provides an overview of each technique, explaining the basic process or algorithm involved. For example, it explains that back propagation is a deep learning algorithm that trains multilayer neural networks in two phases - propagation and weight updating. It also discusses how decision trees use rules to classify weather data based on input parameters, and how k-means clustering groups similar weather observations into clusters. The document aims to compare these techniques for applying data mining to weather forecasting.
Principal Component Analysis(PCA) understanding documentNaveen Kumar
PCA is applied to reduce a dataset into fewer dimensions while retaining most of the variation in the data. It works by calculating the covariance matrix of the data and extracting eigenvectors with the highest eigenvalues, which become the principal components. The EJML Java library can be used to perform PCA by adding sample data, computing the basis using eigenvectors, and projecting samples into the reduced eigenvector space. PCA is generally not useful for datasets containing mostly 0s and 1s, as such sparse data is already in a compact format.
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
PPT on Analysis Of Algorithms.
The ppt includes Algorithms,notations,analysis,analysis of algorithms,theta notation, big oh notation, omega notation, notation graphs
This document provides an overview of Six Sigma methodology. It discusses that Six Sigma aims to reduce defects to 3.4 per million opportunities by using statistical methods. The Six Sigma methodology uses the DMAIC process which stands for Define, Measure, Analyze, Improve, and Control. It also outlines several statistical tools used in Six Sigma like check sheets, Pareto charts, histograms, scatter diagrams, and control charts. Process capability and its measures like Cp, Cpk are also defined. The document aims to explain the key concepts and tools used in Six Sigma to improve quality and processes.
"FingerPrint Recognition Using Principle Component Analysis(PCA)”Er. Arpit Sharma
Fingerprint recognition is one of the oldest and most popular biometric technologies and it is used in criminal investigations, civilian, commercial applications, and so on. Fingerprint matching is the process used to determine whether the two sets of fingerprints details come from the same finger or not. This work focuses on feature extraction and minutiae matching stage. There are many matching techniques used for fingerprint recognition systems such as minutiae based matching, pattern based matching, Correlation based matching, and image based matching.
A new method based upon Principal Component Analysis (PCA) for fingerprint enhancement is proposed in this paper. PCA is a useful statistical technique that has found application in fields such as face recognition and image compression, and is a common technique for finding patterns in data of high dimension. In the proposed method image is first decomposed into directional images using decimation free Directional Filter bank DDFB. Then PCA is applied to these directional fingerprint images which gives the PCA filtered images. Which are basically directional images? Then these directional images are reconstructed into one image which is the enhanced one. Simulation results are included illustrating the capability of the proposed method.
This document provides an overview of key mathematical concepts relevant to machine learning, including linear algebra (vectors, matrices, tensors), linear models and hyperplanes, dot and outer products, probability and statistics (distributions, samples vs populations), and resampling methods. It also discusses solving systems of linear equations and the statistical analysis of training data distributions.
This document discusses greedy algorithms and dynamic programming. It explains that greedy algorithms find local optimal solutions at each step, while dynamic programming finds global optimal solutions by considering all possibilities. The document also provides examples of problems solved using each approach, such as Prim's algorithm and Dijkstra's algorithm for greedy, and knapsack problems for dynamic programming. It then discusses the matrix chain multiplication problem in detail to illustrate how a dynamic programming solution works by breaking the problem into overlapping subproblems.
This document provides an overview of demand forecasting and inventory prediction techniques. It discusses the importance of accurate forecasting to ensure sufficient inventory levels. Key elements for successful forecasting include historical data on inventory levels, orders, trends, seasonality, and expected demand. Common forecasting models are explained, including simple exponential smoothing, Holt's linear trend method, and Holt-Winters seasonal method. The document also covers concepts like stationarity, differencing time series data to make it stationary, and using autoregressive integrated moving average (ARIMA) and seasonal ARIMA (SARIMA) models to forecast time series with trends or seasonal patterns. Homework is assigned to further experiment with transforming time series to achieve stationarity
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
This is the documentation of the study-meeting in lab.
Tha book title is "Hands-On Machine Learning with Scikit-Learn and TensorFlow" and this is the chapter 8.
This document discusses the divide and conquer algorithm design strategy and provides an analysis of the merge sort algorithm as an example. It begins by explaining the divide and conquer strategy of dividing a problem into smaller subproblems, solving those subproblems recursively, and combining the solutions. It then provides pseudocode and explanations for the merge sort algorithm, which divides an array in half, recursively sorts the halves, and then merges the sorted halves back together. It analyzes the time complexity of merge sort as Θ(n log n), proving it is more efficient than insertion sort.
External sorting algorithms are needed to sort data that is too large to fit into main memory. Traditional sorting algorithms require the entire input to reside in memory. External sorting techniques divide the large input into chunks that can fit in memory, sort each chunk, and then merge the sorted chunks. The document discusses how external accesses to data on disk are much slower than accessing data in memory, necessitating new algorithms to minimize disk I/O for large datasets that cannot fit in memory.
1. Distributed optimization techniques are needed to train machine learning models on large datasets.
2. Gradient descent and its variants are commonly used optimization methods for training ML models. These include batch gradient descent, stochastic gradient descent, momentum gradient descent, Nesterov accelerated gradient, Adagrad, Adadelta, and RMSprop.
3. Each method has a different approach to updating model parameters in order to minimize an objective function more efficiently. For example, momentum helps overcome oscillations, while Adagrad adapts learning rates for each parameter.
Principal Component Analysis (PCA) is a technique used to simplify complex data sets by identifying patterns in the data and expressing it in such a way to highlight similarities and differences. It works by subtracting the mean from the data, calculating the covariance matrix, and determining the eigenvectors and eigenvalues to form a feature vector representing the data in a lower dimensional space. PCA can be used to represent image data as a one dimensional vector by stacking the pixel rows of an image and applying this analysis to multiple images.
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
Graph-structured data in network security, social networks, finance, and other applications not only are massive but also under continual evolution. The changes often are scattered across the graph, permitting novel parallel and incremental analysis algorithms. We discuss analysis algorithms for streaming graph data to maintain both local and global metrics with low latency and high efficiency.
PCA (Principal Component Analysis) is a technique used to simplify complex data sets by reducing their dimensionality. It transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. The document provides background on concepts like variance, covariance, and eigenvalues that are important to understanding PCA. It also includes an example of using PCA to analyze student data and identify the most important parameters to describe students.
The document discusses hash tables and collision resolution techniques for hash tables. It defines hash tables as an implementation of dictionaries that use hash functions to map keys to array slots. Collisions occur when multiple keys hash to the same slot. Open addressing techniques like linear probing and quadratic probing search the array sequentially for empty slots when collisions occur. Separate chaining creates an array of linked lists so items can be inserted into lists when collisions occur.
This document discusses complexity analysis of algorithms. It defines an algorithm and lists properties like being correct, unambiguous, terminating, and simple. It describes common algorithm design techniques like divide and conquer, dynamic programming, greedy method, and backtracking. It compares divide and conquer with dynamic programming. It discusses algorithm analysis in terms of time and space complexity to predict resource usage and compare algorithms. It introduces asymptotic notations like Big-O notation to describe upper bounds of algorithms as input size increases.
This document provides an overview of the Design and Analysis of Algorithms course. It discusses the closest pair of points problem and provides a divide and conquer algorithm to solve it in O(n log^2 n) time. The algorithm works by recursively dividing the problem into subproblems on left and right halves, computing the closest pairs for each, and then combining results while searching a sorted array to handle point pairs across divisions. Homework includes improving the closest pair algorithm to O(n log n) time and considering a data structure for orthogonal range searching.
The document provides an introduction to the analysis of algorithms. It discusses key concepts like the definition of an algorithm, properties of algorithms, common computational problems, and basic issues related to algorithms. It also covers algorithm design strategies, fundamental data structures, and the fundamentals of analyzing algorithm efficiency. Examples of algorithms for computing the greatest common divisor and checking for prime numbers are provided to illustrate algorithm design and analysis.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Brian Brazil
Counters are one of the two core metric types in Prometheus, allowing for tracking of request rates, error ratios and other key measurements. Learn why are they designed the way they are, how client libraries implement them and how rate() works.
If you'd like more information about Prometheus, contact us at prometheus@robustperception.io
The document summarizes the Count-Min Sketch streaming algorithm. It uses a two-dimensional array and d independent hash functions to estimate item frequencies in a data stream using sublinear space. It works by incrementing the appropriate counters in each row when an item arrives. The estimated frequency of an item is the minimum value across the rows. Analysis shows that for an array width w proportional to 1/ε, the estimate will be within an additive error of ε times the total frequency with high probability.
This document discusses statistical computing for big data using distributed computing frameworks like MapReduce and Hadoop. It introduces MapReduce concepts like mappers, reducers, and Hadoop components including HDFS and YARN. Statistical challenges with big data are described, like scalability, dimensionality, and heterogeneity. The document discusses approaches for computing statistics on large datasets in parallel, including the Bag of Little Bootstraps method which breaks data into partitions to allow bootstrapping computations to run independently on clusters. Examples of computing means and counts in parallel using MapReduce are also provided.
This document provides an introduction to the analysis of algorithms. It defines an algorithm and lists key properties including being finite, definite, and able to produce the correct output for any valid input. Common computational problems and basic algorithm design strategies are outlined. Asymptotic notations for analyzing time and space efficiency are introduced. Examples of algorithms for calculating the greatest common divisor and determining if a number is prime are provided and analyzed. Fundamental data structures and techniques for analyzing recursive algorithms are also discussed.
The document provides an overview of learning Bayes networks from data. It discusses learning the structure and conditional probability tables (CPTs) of a Bayes network given training data. When the network structure is known, the CPTs can be directly estimated from sample statistics in the training data, handling both cases of complete and missing data using techniques like expectation-maximization. When the structure is unknown, scoring metrics like minimum description length are used to search the space of possible structures to find the best fitting network. Dynamic decision networks extend this framework to model sequential decision making problems.
The document discusses the analysis of algorithms, including time and space complexity analysis. It covers key aspects of analyzing algorithms such as determining the basic operation, input size, and analyzing best-case, worst-case, and average-case time complexities. Specific examples are provided, such as analyzing the space needed to store real numbers and analyzing the time complexity of sequential search. Order of growth and asymptotic analysis techniques like Big-O, Big-Omega, and Big-Theta notation are also explained.
This document discusses an upcoming lecture on linear regression and gradient descent. The lecture will cover gradient descent for linear regression, implementing gradient descent in code, and interpreting models from multiple linear regression. It will review cost functions and the intuition behind gradient descent, then demonstrate gradient descent for linear regression.
It is rather surprising that in software engineering, standard measurement units have yet to be
widely accepted and used. Every other engineering discipline has their own. By and large, effort
is the most commonly used parameter for measuring software initiatives. The problem of
course is that effort is not an independent variable – it depends on who is doing the work and
how it is done. This presentation looks at an approach that has been used to convert the large
amount of effort data usually collected in an organization into something that can meaningfully
be used for estimation and comparison purposes.
Merge sort analysis and its real time applicationsyazad dumasia
The document provides an analysis of the merge sort algorithm and its applications in real-time. It begins with introducing sorting and different sorting techniques. Then it describes the merge sort algorithm, explaining the divide, conquer, and combine steps. It analyzes the time complexity of merge sort, showing that it has O(n log n) runtime. Finally, it discusses some real-time applications of merge sort, such as recommending similar products to users on e-commerce websites based on purchase history.
Divide and conquer is an algorithm design paradigm where a problem is broken into smaller subproblems, those subproblems are solved independently, and then their results are combined to solve the original problem. Some examples of algorithms that use this approach are merge sort, quicksort, and matrix multiplication algorithms like Strassen's algorithm. The greedy method works in stages, making locally optimal choices at each step in the hope of finding a global optimum. It is used for problems like job sequencing with deadlines and the knapsack problem. Minimum cost spanning trees find subgraphs of connected graphs that include all vertices using a minimum number of edges.
This document provides an overview of asymptotic analysis and Landau notation. It discusses justifying algorithm analysis mathematically rather than experimentally. Examples are given to show that two functions may appear different but have the same asymptotic growth rate. Landau symbols like O, Ω, o and Θ are introduced to describe asymptotic upper and lower bounds between functions. Big-Q represents asymptotic equivalence between functions, meaning one can be improved over the other with a faster computer.
This document provides an overview of a lecture on designing and analyzing computer algorithms. It discusses key concepts like what an algorithm and program are, common algorithm design techniques like divide-and-conquer and greedy methods, and how to analyze algorithms' time and space complexity. The goals of analyzing algorithms are to understand their behavior, improve efficiency, and determine whether problems can be solved within a reasonable time frame.
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
The ability to mine and extract useful information automatically, from large datasets, is a
common concern for organizations (having large datasets), over the last few decades. Over the
internet, data is vastly increasing gradually and consequently the capacity to collect and store
very large data is significantly increasing.
Existing clustering algorithms are not always efficient and accurate in solving clustering
problems for large datasets.
However, the development of accurate and fast data classification algorithms for very large
scale datasets is still a challenge. In this paper, various algorithms and techniques especially,
approach using non-smooth optimization formulation of the clustering problem, are proposed
for solving the minimum sum-of-squares clustering problems in very large datasets. This
research also develops accurate and real time L2-DC algorithm based with the incremental
approach to solve the minimum
Similar to Ke yi small summaries for big data (20)
Machine Status Prediction for Dynamic and Heterogenous Cloud Environmentjins0618
The widespread utilization of cloud computing services
has brought in the emergence of cloud service reliability
as an important issue for both cloud providers and users. To
enhance cloud service reliability and reduce the subsequent losses, the future status of virtual machines should be monitored in real time and predicted before they crash. However, most existing methods ignore the following two characteristics of actual cloud
environment, and will result in bad performance of status prediction:
1. cloud environment is dynamically changing; 2. cloud
environment consists of many heterogeneous physical and virtual
machines. In this paper, we investigate the predictive power of
collected data from cloud environment, and propose a simple yet
general machine learning model StaP to predict multiple machine
status. We introduce the motivation, the model development
and optimization of the proposed StaP. The experimental results
validated the effectiveness of the proposed StaP.
Latent Interest and Topic Mining on User-item Bipartite Networksjins0618
Latent Factor Model (LFM) is extensively used in
dealing with user-item bipartite networks in service recommendation systems. To alleviate the limitations of LFM, this papers presents a novel unsupervised learning model, Latent Interest and Topic Mining model (LITM), to automatically
mine the latent user interests and item topics from user-item
bipartite networks. In particular, we introduce the motivation
and objectives of this bipartite network based approach, and
detail the model development and optimization process of the
proposed LITM. This work not only provides an efficient method for latent user interest and item topic mining, but also highlights a new way to improve the accuracy of service recommendation. Experimental studies are performed and the results validate the LITM’s efficiency in model training, and its ability to provide better service recommendation performance based on user-item bipartite networks are demonstrated.
Web Service QoS Prediction Approach in Mobile Internet Environmentsjins0618
Existing many Web service QoS prediction
approaches are very accurate in Internet environments,
however they cannot provide accurate prediction values in
Mobile Internet environments since QoS values of Web
services have great volatility. In this paper, we propose an
accurate Web service QoS prediction approach by weakening
the volatility of QoS data from Web services in Mobile Internet
environments. This approach contains three process, i.e., QoS
preprocessing, user similarity computing, and QoS predicting.
We have implemented our proposed approach with experiment
based on real world and synthetic datasets. The results show
that our approach outperforms other approaches in Mobile
Internet environments.
This document outlines a course on mining heterogeneous information networks. The course will cover phrase mining and topic modeling from large text corpora, entity extraction and typing through relational graph construction and propagation, as well as mining and constructing heterogeneous information networks.
Christian jensen advanced routing in spatial networks using big datajins0618
Advanced Routing in Spatial Networks Using Big Data discusses using big data and advanced routing techniques for transportation networks. It covers modeling transportation networks using big data from sensors to assign time-varying weights representing factors like travel time and emissions. It then discusses routing algorithms that find optimal routes considering these weights, including algorithms for stochastic and uncertain weights. The document provides an overview of using big data to improve transportation network modeling and routing.
This document discusses influenceability estimation in social networks. It describes the independent cascade model of influence diffusion, where each node has an independent probability of influencing its neighbors. The problem is to estimate the expected number of nodes reachable from a given seed node. The document presents the naive Monte Carlo (NMC) approach, which samples possible graphs and averages the number of reachable nodes over the samples. While NMC provides an unbiased estimator, it has high variance. The document aims to reduce the variance to improve estimation accuracy.
Calton pu experimental methods on performance in cloud and accuracy in big da...jins0618
Experimental Methods on Performance in Clouds provides an overview of cloud computing and benchmarks for measuring cloud performance. It discusses:
1) The evolution of cloud computing from early data centers to modern cloud platforms like Amazon Web Services.
2) Examples of cloud workloads and benchmarks used to test performance, including RUBiS for e-commerce and RUBBoS for bulletin boards.
3) The challenges of modeling and measuring cloud performance at scale due to the large number of variable configurations, and the need for automation through frameworks like Expertus.
This document discusses challenges and opportunities in parallel graph processing for big data. It describes how graphs are ubiquitous but processing large graphs at scale is difficult due to their huge size, complex correlations between data entities, and skewed distributions. Current computation models have problems with ghost vertices, too much interaction between partitions, and lack of support for iterative graph algorithms. New frameworks are needed to handle these graphs in a scalable way with low memory usage and balanced computation and communication.
This document discusses challenges in processing large graphs and introduces an approach called GraphLego. It describes how GraphLego models large graphs as 3D cubes partitioned into slices, strips and dices to balance parallel computation. GraphLego optimizes access locality by minimizing disk access and compressing partitions. It also uses regression-based learning to optimize partitioning parameters and runtime. The document evaluates GraphLego on real-world graphs, finding it outperforms existing single-machine graph processing systems in execution efficiency and partitioning decisions.
Wang ke mining revenue-maximizing bundling configurationjins0618
This document presents algorithms for mining revenue-maximizing bundling configurations from consumer preference data. It discusses how willingness to pay for items can be estimated from online ratings data. The bundle configuration problem of grouping items into bundles to maximize total revenue is formulated and shown to be NP-hard for bundles of size 3 or more. Heuristic algorithms based on graph matching and greedy approaches are proposed to solve the problem approximately. The algorithms are evaluated on a real dataset of Amazon book ratings, demonstrating increased revenue from bundling over selling items individually.
Wang ke classification by cut clearance under thresholdjins0618
This document proposes a new classification method called CUT (Classification Under Threshold) that partitions data into groups that are either "cleared" or "not cleared" based on a user-specified threshold. The goal is to maximize the number of future cases that can be cleared without intervention. It was tested on problems like predicting transformers with carcinogenic PCBs. Experimental results found CUT outperformed other methods by clearing more non-hazardous cases while keeping errors under the threshold.
The document summarizes an entity extraction and typing framework proposed by the author. The framework constructs a heterogeneous graph connecting entity mentions, surface names, and relation phrases extracted from documents. It then performs joint type propagation and relation phrase clustering on the graph to infer types for entity mentions. Evaluation on news, tweets and reviews shows the framework outperforms existing methods in recognizing new types and domains without extensive feature engineering or human supervision. It obtains improvements by modeling each mention individually and addressing data sparsity through relation phrase clustering.
Strategy 3 for topical phrase mining first performs phrase mining on a corpus to extract candidate phrases, then applies topic modeling with the phrases as constraints. This approach generates coherent topics where words within a phrase share a topic label. Strategy 3 outputs high-quality topics and phrases faster than Strategies 1 and 2, and the topics and phrases have better coherence, quality, and intrusion scores. However, Strategy 3's topic inference relies on the accuracy of the initial phrase mining.
This document discusses mining heterogeneous information networks. It begins by defining heterogeneous information networks as information networks containing multiple object and link types. It then discusses how heterogeneous networks are richer than homogeneous networks derived from them by projection. Several examples of heterogeneous networks are given, such as bibliographic, social media, and healthcare networks. The document outlines principles for mining heterogeneous networks, including using meta-paths to explore network structures and relationships. It introduces methods for ranking, clustering, and classifying nodes in heterogeneous networks, such as the RankClus and NetClus algorithms, which integrate ranking and clustering.
This document outlines a course on mining heterogeneous information networks. The course will cover phrase mining and topic modeling from large text corpora, entity extraction and typing through relational graph construction and propagation, as well as mining and constructing heterogeneous information networks.
The document summarizes a talk on analyzing the truthfulness of web data. It discusses analyzing structured data from two domains - stock prices and flight statuses - from multiple online sources. It was found that there is a lot of redundant data across sources, but also significant inconsistencies. Various techniques were explored to resolve inconsistencies and find true values, such as voting methods that leverage source accuracy and ignore copied data. Detecting copying between sources is important for improving truth finding but also computationally challenging. Scaling up copy detection methods can help address this challenge.
Gao cong geospatial social media data management and context-aware recommenda...jins0618
The document discusses geospatial social media data management and context-aware recommendation. It introduces technologies for geo-positioning users and content, and how user generated content from social media is increasingly associated with geo-locations. The document then outlines queries for static geo-textual data, publish/subscribe queries on geo-textual data streams, and personalized, context-aware point-of-interest recommendation based on modeling user behavior from geo-textual data.
Chengqi zhang graph processing and mining in the era of big datajins0618
The document discusses challenges and opportunities in graph processing and mining for big data, including new graph semantics, mining tasks, query processing algorithms, indexing techniques, and computing models needed to handle large graph datasets. It outlines the speaker's work on topics such as structural keyword search, graph matching, community detection, and graph classification. It also covers the speaker's work on efficient graph algorithms, parallel and distributed graph computation, and graph processing system design.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
2. The case for “Big Data” in one slide
“Big” data arises in many forms:
– Medical data: genetic sequences, time series
– Activity data: GPS location, social network activity
– Business data: customer behavior tracking at fine detail
– Physical Measurements: from science (physics, astronomy)
The 3 V’s:
– Volume
– Velocity
– Variety
Small Summaries for Big Data
2
focus of this talk
3. Computational scalability
The first (prevailing) approach: scale up the computation
Many great technical ideas:
– Use many cheap commodity devices
– Accept and tolerate failure
– Move code to data
– MapReduce: BSP for programmers
– Break problem into many small pieces
– Decide which constraints to drop: noSQL, ACID, CAP
Scaling up comes with its disadvantages:
– Expensive (hardware, equipment, energy), still not always fast
This talk is not about this approach!
Small Summaries for Big Data
3
4. Downsizing data
A second approach to computational scalability:
scale down the data!
– A compact representation of a large data set
– Too much redundancy in big data anyway
– What we finally want is small: human readable analysis / decisions
– Necessarily gives up some accuracy: approximate answers
– Often randomized (small constant probability of error)
– Examples: samples, sketches, histograms, wavelet transforms
Complementary to the first approach: not a case of either-or
Some drawbacks:
– Not a general purpose approach: need to fit the problem
– Some computations don’t allow any useful summary
Small Summaries for Big Data
4
5. Outline for the talk
Some most important data summaries
– Sketches: Bloom filter, Count-Min, Count-Sketch (AMS)
– Counters: Heavy hitters, quantiles
– Sampling: simple samples, distinct sampling
– Summaries for more complex objects: graphs and matrices
Current trends and future challenges for data summarization
Many abbreviations and omissions (histograms, wavelets, ...)
A lot of work relevant to compact summaries
– In both the database and the algorithms literature
Small Summaries for Big Data
5
6. Small Summaries for Big Data
6
Summary Construction
There are several different models for summary construction
– Offline computation: e.g. sort data, take percentiles
– Streaming: summary merged with one new item each step
– Full mergeability: allow arbitrary merges of partial summaries
The most general and widely applicable category
Key methods for summaries:
– Create an empty summary
– Update with one new tuple: streaming processing
Insert and delete
– Merge summaries together: distributed processing (eg MapR)
– Query: may tolerate some approximation (parameterized by 𝜀)
Several important cost metrics (as function of 𝜀, 𝑛):
– Size of summary, time cost of each operation
8. Small Summaries for Big Data
Bloom Filters [Bloom 1970]
Bloom filters compactly encode set membership
– E.g. store a list of many long URLs compactly
– 𝑘 random hash functions map items to 𝑚-bit vector 𝑘 times
– Update: Set all 𝑘 entries to 1 to indicate item is present
– Query: Is 𝑥 in the set?
Return “yes” if all 𝑘 bits 𝑥 maps to are 1.
Duplicate insertions do not change Bloom filters
Can be merged by OR-ing vectors (of same size)
item
1 1 1
8
9. Bloom Filters Analysis
If item is in the set, then always return “yes”
– No false negative
If item is not in the set, may return “yes” with a probability.
– Prob. that a bit is 0: 1 −
1
𝑚
𝑘𝑛
– Prob. that all 𝑘 bits are 1:
1 − 1 −
1
𝑚
𝑘𝑛 𝑘
≈ exp(𝑘 ln(1 − 𝑒 𝑘𝑛/𝑚
)
– Setting 𝑘 =
𝑚
𝑛
ln 2 minimizes this probability to 0.6185 𝑚/𝑛
.
For false positive prob 𝛿, Bloom filter needs 𝑂(𝑛 log
1
𝛿
) bits
Small Summaries for Big Data
9
10. Bloom Filters Applications
Bloom Filters widely used in “big data” applications
– Many problems require storing a large set of items
Can generalize to allow deletions
– Swap bits for counters: increment on insert, decrement on delete
– If no duplicates, small counters suffice: 4 bits per counter
Bloom Filters are still an active research area
– Often appear in networking conferences
– Also known as “signature files” in DB
Small Summaries for Big Data
10
item
1 2 22 2
item item
11. Invertible Bloom Lookup Table [GM11]
A summary that
– Support insertions and deletions
– When # item ≤ 𝑡, can retrieve all items
– Also known as sparse recovery
Easy case when 𝑡 = 1
– Assume all items are integers
– Just keep 𝑧 = the bitwise-XOR of all items
To insert 𝑥: 𝑧 ← 𝑧 ⊕ 𝑥
To delete 𝑥: 𝑧 ← 𝑧 ⊕ 𝑥
– Assumption: no duplicates; can’t delete a non-existing item
A problem: How do we know when # items = 1?
– Just keep another counter
Small Summaries for Big Data
11
12. Invertible Bloom Lookup Table
Structure is the same as a Bloom filter, but each cell
– is the XOR or all items mapped here
– also keeps a counter of such items
How to recover?
– First check if there is any cell has only 1 item
– Recover the item if there is one
– Delete the item from the all cells, and repeat
Small Summaries for Big Data
12
item
1 2 22 2
item item
13. Invertible Bloom Lookup Table
Can show that
– Using 1.5𝑡 cells, can recover ≤ 𝑡 items with high probability
Small Summaries for Big Data
13
2 21 1
item item
14. Small Summaries for Big Data
Count-Min Sketch [CM 04]
Count Min sketch encodes item counts
– Allows estimation of frequencies (e.g. for selectivity estimation)
– Some similarities in appearance to Bloom filters
Model input data as a vector 𝑥 of dimension 𝑈
– E.g., [0. . 𝑈] is the space of all IP addresses, 𝑥[𝑖] is the frequency
of IP address 𝑖 in the data
– Create a small summary as an array of 𝑤 × 𝑑 in size
– Use 𝑑 hash function to map vector entries to [1..𝑤]
𝑊
𝑑
Array:
𝐶𝑀[𝑖, 𝑗]
14
15. Small Summaries for Big Data
Count-Min Sketch Structure
Update: each entry in vector 𝑥 is mapped to one bucket per row.
Merge two sketches by entry-wise summation
Query: estimate 𝑥[𝑖] by taking min 𝑘 𝐶𝑀[𝑘, ℎ 𝑘(𝑖)]
– Never underestimate
– Will bound the error of overestimation
+𝑐
+𝑐
+𝑐
+𝑐
ℎ1(𝑖)
ℎ 𝑑(𝑖)
𝑖, +𝑐
𝑑rows
𝑤 = 2/𝜀
15
16. Count-Min Sketch Analysis
Focusing on first row:
– 𝑥 𝑖 is always added to 𝐶𝑀[1, ℎ1 𝑖 ]
– 𝑥 𝑗 , 𝑗 ≠ 𝑖 is added to 𝐶𝑀[1, ℎ1 𝑖 ] with prob. 1/𝑤
– The expected error is 𝑥 𝑗 /𝑤
– The total expected error is
𝑗≠𝑖 𝑥 𝑗
𝑤
≤
| 𝑥 |1
𝑤
– By Markov inequality, Pr[error>
2⋅| 𝑥 |1
𝑤
] <
1
2
By taking the minimum of 𝑑 rows, this prob. is
1
2
𝑑
To give an 𝜀||𝑥||1 error with prob. 1 − 𝛿, the sketch needs to
have size
2
𝜀
× log
1
𝛿
Small Summaries for Big Data
16
17. Generalization: Sketch Structures
Sketch is a class of summary that is a linear transform of input
– Sketch(𝑥) = 𝑆𝑥 for some matrix 𝑆
– Hence, Sketch(𝑥 + 𝛽𝑦) = Sketch(𝑥) + 𝛽 Sketch(𝑦)
– Trivial to update and merge
Often describe 𝑆 in terms of hash functions
Analysis relies on properties of the hash functions
– Some assumes truly random hash functions (e.g., Bloom filter)
Don’t exist, but ad hoc hash functions work well for many
real-world data
– Others need hash functions with limited independence.
Small Summaries for Big Data
17
18. Small Summaries for Big Data
Count Sketch / AMS Sketch [CCF-C02, AMS96]
Second hash function: 𝑔 𝑘 maps each 𝑖 to +1 or −1
Update: each entry in vector 𝑥 is mapped to one bucket per row.
Merge two sketches by entry-wise summation
Query: estimate 𝑥[𝑖] by taking median 𝑘 𝐶[𝑘, ℎ 𝑘 𝑖 𝑔 𝑘 𝑖 ]
– May either overestimate or underestimate
+𝑐𝑔1(𝑖)
+𝑐𝑔2(𝑖)
+𝑐𝑔3(𝑖)
+𝑐𝑔 𝑑(𝑖)
ℎ1(𝑖)
ℎ 𝑑(𝑖)
𝑖, +𝑐
𝑑rows
𝑤
18
19. Count Sketch Analysis
Focusing on first row:
– 𝑥 𝑖 𝑔1 𝑖 is always added to 𝐶[1, ℎ1 𝑖 ]
We return 𝐶 1, ℎ1 𝑖 𝑔1 𝑖 = 𝑥[𝑖]
– 𝑥 𝑗 𝑔1 𝑗 , 𝑗 ≠ 𝑖 is added to 𝐶[1, ℎ1 𝑖 ] with prob. 1/𝑤
– The expected error is 𝑥 𝑗 𝐸[𝑔1 𝑖 𝑔1 𝑗 ]/𝑤 = 0
– E[total error] = 0, E[|total error|] ≤
| 𝑥 |2
𝑤
– By Chebyshev inequality, Pr[|total error|>
2⋅| 𝑥 |2
𝑤
] <
1
4
By taking the median of 𝑑 rows, this prob. is
1
2
𝑂(𝑑)
To give an 𝜀||𝑥||2 error with prob. 1 − 𝛿, the sketch needs to
have size 𝑂
1
𝜀2 log
1
𝛿
Small Summaries for Big Data
19
20. Count-Min Sketch vs Count Sketch
Count-Min:
– Size: 𝑂 𝑤𝑑
– Error: ||𝑥||1/𝑤
Small Summaries for Big Data
20
Count Sketch:
– Size: 𝑂 𝑤𝑑
– Error: ||𝑥||2/ 𝑤
Other benefits:
– Unbiased estimator
– Can also be used to estimate
𝑥 ⋅ 𝑦 (join size)
Count Sketch is better when ||𝑥||2 < ||𝑥||1/ 𝑤
21. Application to Large Scale Machine Learning
In machine learning, often have very large feature space
– Many objects, each with huge, sparse feature vectors
– Slow and costly to work in the full feature space
“Hash kernels”: work with a sketch of the features
– Effective in practice! [Weinberger, Dasgupta, Langford, Smola, Attenberg ‘09]
Similar analysis explains why:
– Essentially, not too much noise on the important features
Small Summaries for Big Data
21
23. Small Summaries for Big Data
23
Heavy hitters [MG82]
Misra-Gries (MG) algorithm finds up to 𝑘 items that occur
more than 1/𝑘 fraction of the time in a stream
– Estimate their frequencies with additive error ≤ 𝑁/𝑘
– Equivalently, achieves 𝜀𝑁 error with 𝑂(1/𝜀) space
Better than Count-Min but doesn’t support deletions
Keep 𝑘 different candidates in hand. To add a new item:
– If item is monitored, increase its counter
– Else, if < 𝑘 items monitored, add new item with count 1
– Else, decrease all counts by 1
1 2 3 4 5 6 7 8 9
𝑘 = 5
24. Small Summaries for Big Data
24
Heavy hitters
Misra-Gries (MG) algorithm finds up to 𝑘 items that occur
more than 1/𝑘 fraction of the time in a stream [MG’82]
– Estimate their frequencies with additive error ≤ 𝑁/𝑘
Keep 𝑘 different candidates in hand. To add a new item:
– If item is monitored, increase its counter
– Else, if < 𝑘 items monitored, add new item with count 1
– Else, decrease all counts by 1
1 2 3 4 5 6 7 8 9
𝑘 = 5
25. Small Summaries for Big Data
25
Heavy hitters
Misra-Gries (MG) algorithm finds up to 𝑘 items that occur
more than 1/𝑘 fraction of the time in a stream [MG’82]
– Estimate their frequencies with additive error ≤ 𝑁/𝑘
Keep 𝑘 different candidates in hand. To add a new item:
– If item is monitored, increase its counter
– Else, if < 𝑘 items monitored, add new item with count 1
– Else, decrease all counts by 1
1 2 3 4 5 6 7 8 9
𝑘 = 5
26. Small Summaries for Big Data
26
MG analysis
𝑁 = total input size
𝑀 = sum of counters in data structure
Error in any estimated count at most (𝑁 − 𝑀)/(𝑘 + 1)
– Estimated count a lower bound on true count
– Each decrement spread over (𝑘 + 1) items: 1 new one and 𝑘 in
MG
– Equivalent to deleting (𝑘 + 1) distinct items from stream
– At most (𝑁 − 𝑀)/(𝑘 + 1) decrement operations
– Hence, can have “deleted” (𝑁 − 𝑀)/(𝑘 + 1) copies of any item
– So estimated counts have at most this much error
27. Small Summaries for Big Data
27
Merging two MG summaries [ACHPZY12]
Merging algorithm:
– Merge two sets of 𝑘 counters in the obvious way
– Take the (𝑘 + 1)th largest counter = 𝐶 𝑘+1, and subtract from
all
– Delete non-positive counters
1 2 3 4 5 6 7 8 9
𝑘 = 5
28. (prior error) (from merge)
Small Summaries for Big Data
28
Merging two MG summaries
This algorithm gives mergeability:
– Merge subtracts at least 𝑘 + 1 𝐶 𝑘+1 from counter sums
– So 𝑘 + 1 𝐶 𝑘+1 (𝑀1 + 𝑀2 – 𝑀12)
Sum of remaining (at most 𝑘) counters is 𝑀12
– By induction, error is
((𝑁1 − 𝑀1) + (𝑁2 − 𝑀2) + (𝑀1 + 𝑀2– 𝑀12))/(𝑘 + 1)
= ((𝑁1 + 𝑁2) – 𝑀12)/(𝑘 + 1)
1 2 3 4 5 6 7 8 9
𝑘 = 5
𝐶 𝑘+1
29. Summarizing Disitributed Data
29
Quantiles (order statistics)
Exact quantiles: 𝐹−1() for 0 < < 1, 𝐹 : CDF
Approximate version: tolerate answer between
𝐹−1( − ) … 𝐹−1( + )
Dual problem - rank estimation: Given 𝑥, estimate 𝐹(𝑥)
– Can use binary search to find quantiles
30. Quantiles gives equi-height histogram
Automatically adapts to skew data distributions
Equi-width histograms (fixed binning) are is to construct but
does not adapt to data distribution
Summarizing Disitributed Data
30
31. The GK Quantile Summary [GK01]
3
Incoming: 3
0
node: value & rank
N=1
42. The GK Quantile Summary
The practical version
– Space: Open problem!
– Small in practice
A complicated version achieves 𝑂
1
𝜀
log 𝜀𝑁 space
Doesn’t support deletions
Doesn’t (don’t know how to) support merging
There is a randomized mergeable quantile summary
43. A Quantile Summary Supporting Deletions [WLYC13]
Assumption: all elements are integers from [0..U]
Use a dyadic decomposition
Small Summaries for Big Data
43
0 U-1U/2
# of elements
between U/2 and U-
1
45. A Quantile Summary Supporting Deletions
Consider each layer as a frequency vector, and build a sketch
– Count Sketch is better than Count-Min due to unbiasedness
Space: 𝑂
1
𝜀
log1.5
𝑈
CS
CS
CS
CS
47. Random Sample as a Summary
Cost
– A random sample is cheaper to build if you have random access
to data
Error-size tradeoff
– Specialized summaries often have size 𝑂(1/𝜀)
– A random sample needs size 𝑂(1/𝜀2
)
Generality
– Specialized summaries can only answer one type of queries
– Random sample can be used to answer different queries
In particular, ad hoc queries that are unknown beforehand
Small Summaries for Big Data
47
48. Sampling Definitions
Sampling without replacement
– Randomly draw an element
– Don’t put it back
– Repeat s times
Sampling with replacement
– Randomly draw an element
– Put it back
– Repeat s times
Coin-flip sampling
– Sample each element with probability 𝑝
– Sample size changes for dynamic data
Small Summaries for Big Data
48
The statistical difference is
very small, for 𝑁 ≫ 𝑠 ≫ 1
49. Reservoir Sampling
Given a sample of size 𝑠 drawn from a data set of size 𝑁
(without replacement)
A new element in added, how to update the sample?
Algorithm
– 𝑁 ← 𝑁 + 1
– With probability 𝑠/𝑁, use it to replace an item in the
current sample chosen uniformly at random
– With probability 1 − 𝑠/𝑁, throw it away
Small Summaries for Big Data
49
50. Reservoir Sampling Correctness Proof
Many “proofs” found online are actually wrong
– They only show that each item is sampled with probability 𝑠/𝑁
– Need to show that every subset of size 𝑠 has the same
probability to be the sample
Correct proof relates with the Fisher-Yates shuffle
Small Summaries for Big Data
50
a
b
c
d
b
a
c
d
a
b
c
d
b
c
a
d
b
d
a
c
s = 2
56. Random Sampling with Deletions
Naïve method:
– Maintain a sample of size 𝑀 > 𝑠
– Handle insertions as in reservoir sampling
– If the deleted item is not in the sample, ignore it
– If the deleted item is in the sample, delete it from the sample
– Problem: the sample size may drop to below 𝑠
Some other algorithms in the DB literature but sample size
still drops when there are many deletions
Small Summaries for Big Data
56
57. Small Summaries for Big Data
Min-wise Hashing / Sampling
For each item 𝑥, compute ℎ 𝑥
Return the 𝑠 items with the smallest values of ℎ(𝑥) as the
sample
– Assumption: ℎ is a truly random hash function
– A lot of research on how to remove this assumption
57
58. 𝑳 𝟎-Sampling [FIS05, CMR05]
Suppose ℎ 𝑥 ∈ [0. . 𝑢 − 1] and always has log 𝑢 bits (pad
zeroes when necessary)
Map all items to level 0
Map items that have one leading zero in ℎ(𝑥) to level 1
Map items that have two leading zeroes in ℎ(𝑥) to level 2
…
Use a 2𝑠 -sparse recovery summary for each level
Small Summaries for Big Data
p=1
p=1/U
2𝑠-sparse recovery
58
59. Estimating Set Similarity with MinHash
Assumption: The sample from 𝐴 and the sample from 𝐵 are
obtained by the same ℎ
Pr[(A item randomly sampled from 𝐴 ∪ 𝐵) ∈ 𝐴 ∩ 𝐵] = 𝐽(𝐴, 𝐵)
To estimate 𝐽(𝐴, 𝐵):
– Merge Sample(𝐴) and Sample(𝐵) to get Sample(𝐴 ∪ 𝐵) of size 𝑠
– Count how many items in Sample(𝐴 ∪ 𝐵) are in 𝐴 ∩ 𝐵
– Divide by 𝑠
Small Summaries for Big Data
59
𝐽 𝐴, 𝐵 =
𝐴 ∩ 𝐵
𝐴 ∪ 𝐵
61. Geometric data: -approximations
A sample that preserves “density” of point sets
– For any range (e.g., a circle),
|fraction of sample point − fraction of all points|≤ 𝜀
– Can simply draw a random sample of size 𝑂(1/𝜀2)
– More careful construction yields size 𝑂(1/𝜀2𝑑/(𝑑+1))
Summarizing Disitributed Data
61
62. Summarizing Disitributed Data
62
Geometric data: -kernels
-kernels approximately preserve the convex hull of points
– -kernel has size O(1/(d-1)/2)
– Streaming -kernel has size O(1/(d-1)/2 log(1/))
63. Graph Sketching [AGM12]
Goal: Build a sketch for each vertex and use it to answer queries
Connectivity: want to test if there is a path between any two nodes
– Trivial if edges can only be inserted; want to support deletions
Basic idea: repeatedly contract edges between components
– Use L0 sampling to get edges from vector of adjacencies
– The L0 sampling sketch supports deletions and merges
Problem: as components grow, sampling edges from components
most likely to produce internal links
Small Summaries for Big Data
63
64. Graph Sketching
Idea: use clever encoding of edges
Encode edge (i,j) as ((i,j),+1) for node i<j, as ((i,j),-1) for node j>i
When node i and node j get merged, sum their L0 sketches
– Contribution of edge (i,j) exactly cancels out
– Only non-internal edges remain in the L0 sketches
Use independent sketches for each iteration of the algorithm
– Only need O(log n) rounds with high probability
Result: O(poly-log n) space per node for connectivity
Small Summaries for Big Data
i
j
+
=
64
65. Other Graph Results via sketching
Recent flurry of activity in summaries for graph problems
– K-connectivity via connectivity
– Bipartiteness via connectivity:
– (Weight of the) Minimum spanning tree:
– Sparsification: find G’ with few edges so that cut(G,C) cut(G’,C)
– Matching: find a maximal matching (assuming it is small)
Total cost is typical O(|V|), rather than O(|E|)
– Semi-streaming / semi-external model
Small Summaries for Big Data
65
66. Matrix Sketching
Given matrices A, B, want to approximate matrix product AB
– Measure the normed error of approximation C: ǁAB – Cǁ
Main results for the Frobenius (entrywise) norm ǁǁF
– ǁCǁF = (i,j Ci,j
2)½
– Results rely on sketches, so this entrywise norm is most natural
Small Summaries for Big Data
66
67. Direct Application of Sketches
Build AMS sketch of each row of A (Ai), each column of B (Bj)
Estimate Ci,j by estimating inner product of Ai with Bj
– Absolute error in estimate is ǁAiǁ2 ǁBjǁ2 (whp)
– Sum over all entries in matrix, squared error is ǁAǁFǁBǁF
Outline formalized & improved by Clarkson & Woodruff [09,13]
– Improve running time to linear in number of non-zeros in A,B
Small Summaries for Big Data
67
68. Compressed Matrix Multiplication
What if we are just interested in the large entries of AB?
– Or, the ability to estimate any entry of (AB)
– Arises in recommender systems, other ML applications
If we had a sketch of (AB), could find these approximately
Compressed Matrix Multiplication [Pagh 12]:
– Can we compute sketch(AB) from sketch(A) and sketch(B)?
– To do this, need to dive into structure of the Count (AMS) sketch
Several insights needed to build the method:
– Express matrix product as summation of outer products
– Take convolution of sketches to get a sketch of outer product
– New hash function enables this to proceed
– Use the FFT to speed up from O(w2) to O(w log w)
Small Summaries for Big Data
68
69. More Linear Algebra
Matrix multiplication improvement: use more powerful hash fns
– Obtain a single accurate estimate with high probability
Linear regression given matrix A and vector b:
find x Rd to (approximately) solve minx ǁAx – bǁ
– Approach: solve the minimization in “sketch space”
– From a summary of size O(d2/) [independent of rows of A]
Frequent directions: approximate matrix-vector product
[Ghashami, Liberty, Phillips, Woodruff 15]
– Use the SVD to (incrementally) summarize matrices
The relevant sketches can be built quickly: proportional to the
number of nonzeros in the matrices (input sparsity)
– Survey: Sketching as a tool for linear algebra [Woodruff 14]
Small Summaries for Big Data
69
70. Current Directions in Data Summarization
Sparse representations of high dimensional objects
– Compressed sensing, sparse fast fourier transform
General purpose numerical linear algebra for (large) matrices
– k-rank approximation, linear regression, PCA, SVD, eigenvalues
Summaries to verify full calculation: a ‘checksum for computation’
Geometric (big) data: coresets, clustering, machine learning
Use of summaries in large-scale, distributed computation
– Build them in MapReduce, Continuous Distributed models
Communication-efficient maintenance of summaries
– As the (distributed) input is modified
Small Summaries for Big Data
70
71. Two complementary approaches in response to growing data sizes
– Scale the computation up; scale the data down
The theory and practice of data summarization has many guises
– Sampling theory (since the start of statistics)
– Streaming algorithms in computer science
– Compressive sampling, dimensionality reduction… (maths, stats, CS)
Continuing interest in applying and developing new theory
Small Summary of Small Summaries
71
Small Summaries for Big Data