Cardinality estimation has a wide range of applications and
is of particular importance in database systems. Various
algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them
The objective of this paper is to present the hybrid approach for edge detection. Under this technique, edge
detection is performed in two phase. In first phase, Canny Algorithm is applied for image smoothing and in
second phase neural network is to detecting actual edges. Neural network is a wonderful tool for edge
detection. As it is a non-linear network with built-in thresholding capability. Neural Network can be trained
with back propagation technique using few training patterns but the most important and difficult part is to
identify the correct and proper training set.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
We address the problem of processing multiple graph queries over a massive set of data graphs in this letter. As the number of data graphs is growing rapidly, it is often hard to process graph queries with serial algorithms in a timely manner. We propose a distributed graph querying algorithm, which employs feature-based comparison and a filterand-verify scheme working on the MapReduce framework. Moreover, we devise an ecient scheme that adaptively tunes a proper feature size at runtime by sampling data graphs. With various experiments, we show that the proposed method outperforms conventional algorithms in terms of both scalability and efficiency.
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
Subgraph matching is a fundamental operation for querying
graph-structured data. Due to potential errors and noises in real world graph data, exact subgraph matching is sometimes not appropriate in practice.
In this paper we consider an approximate subgraph matching model that allows missing edges. Based on this model, approximate subgraph matching finds all occurrences of a given query graph in a database graph,
allowing missing edges. A straightforward approach to this problem is to first generate query subgraphs of the query graph by deleting edges and then perform exact subgraph matching for each query subgraph. In this paper we propose a sharing based approach to approximate subgraph matching, called SASUM. Our method is based on the fact that query subgraphs are highly overlapped. Due to this overlapping nature of query subgraphs, the matches of a query subgraph can be computed from the matches of a smaller query subgraph, which results in reducing the number of query subgraphs that need costly exact subgraph matching. Our method uses a lattice framework to identify sharing opportunities between query subgraphs. To further reduce the number of graphs that need exact subgraph matching, SASUM generates small base graphs that are shared by query subgraphs and chooses the minimum number of base graphs whose matches are used to derive the matching results of all query subgraphs. A comprehensive set of experiments shows that our approach outperforms the state-of-the-art
approach by orders of magnitude in terms of query execution time.
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...Pooyan Jamshidi
https://arxiv.org/abs/1606.06543
Finding optimal configurations for Stream Processing Systems (SPS) is a challenging problem due to the large number of parameters that can influence their performance and the lack of analytical models to anticipate the effect of a change. To tackle this issue, we consider tuning methods where an experimenter is given a limited budget of experiments and needs to carefully allocate this budget to find optimal configurations. We propose in this setting Bayesian Optimization for Configuration Optimization (BO4CO), an auto-tuning algorithm that leverages Gaussian Processes (GPs) to iteratively capture posterior distributions of the configuration spaces and sequentially drive the experimentation. Validation based on Apache Storm demonstrates that our approach locates optimal configurations within a limited experimental budget, with an improvement of SPS performance typically of at least an order of magnitude compared to existing configuration algorithms.
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...IJCNCJournal
Rapid development of diverse computer architectures and hardware accelerators caused that designing parallel systems faces new problems resulting from their heterogeneity. Our implementation of a parallel
system called KernelHive allows to efficiently run applications in a heterogeneous environment consisting
of multiple collections of nodes with different types of computing devices. The execution engine of the
system is open for optimizer implementations, focusing on various criteria. In this paper, we propose a new
optimizer for KernelHive, that utilizes distributed databases and performs data prefetching to optimize the
execution time of applications, which process large input data. Employing a versatile data management
scheme, which allows combining various distributed data providers, we propose using NoSQL databases
for our purposes. We support our solution with results of experiments with real executions of our OpenCL
implementation of a regular expression matching application in various hardware configurations.
Additionally, we propose a network-aware scheduling scheme for selecting hardware for the proposed
optimizer and present simulations that demonstrate its advantages.
The objective of this paper is to present the hybrid approach for edge detection. Under this technique, edge
detection is performed in two phase. In first phase, Canny Algorithm is applied for image smoothing and in
second phase neural network is to detecting actual edges. Neural network is a wonderful tool for edge
detection. As it is a non-linear network with built-in thresholding capability. Neural Network can be trained
with back propagation technique using few training patterns but the most important and difficult part is to
identify the correct and proper training set.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
We address the problem of processing multiple graph queries over a massive set of data graphs in this letter. As the number of data graphs is growing rapidly, it is often hard to process graph queries with serial algorithms in a timely manner. We propose a distributed graph querying algorithm, which employs feature-based comparison and a filterand-verify scheme working on the MapReduce framework. Moreover, we devise an ecient scheme that adaptively tunes a proper feature size at runtime by sampling data graphs. With various experiments, we show that the proposed method outperforms conventional algorithms in terms of both scalability and efficiency.
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
Subgraph matching is a fundamental operation for querying
graph-structured data. Due to potential errors and noises in real world graph data, exact subgraph matching is sometimes not appropriate in practice.
In this paper we consider an approximate subgraph matching model that allows missing edges. Based on this model, approximate subgraph matching finds all occurrences of a given query graph in a database graph,
allowing missing edges. A straightforward approach to this problem is to first generate query subgraphs of the query graph by deleting edges and then perform exact subgraph matching for each query subgraph. In this paper we propose a sharing based approach to approximate subgraph matching, called SASUM. Our method is based on the fact that query subgraphs are highly overlapped. Due to this overlapping nature of query subgraphs, the matches of a query subgraph can be computed from the matches of a smaller query subgraph, which results in reducing the number of query subgraphs that need costly exact subgraph matching. Our method uses a lattice framework to identify sharing opportunities between query subgraphs. To further reduce the number of graphs that need exact subgraph matching, SASUM generates small base graphs that are shared by query subgraphs and chooses the minimum number of base graphs whose matches are used to derive the matching results of all query subgraphs. A comprehensive set of experiments shows that our approach outperforms the state-of-the-art
approach by orders of magnitude in terms of query execution time.
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...Pooyan Jamshidi
https://arxiv.org/abs/1606.06543
Finding optimal configurations for Stream Processing Systems (SPS) is a challenging problem due to the large number of parameters that can influence their performance and the lack of analytical models to anticipate the effect of a change. To tackle this issue, we consider tuning methods where an experimenter is given a limited budget of experiments and needs to carefully allocate this budget to find optimal configurations. We propose in this setting Bayesian Optimization for Configuration Optimization (BO4CO), an auto-tuning algorithm that leverages Gaussian Processes (GPs) to iteratively capture posterior distributions of the configuration spaces and sequentially drive the experimentation. Validation based on Apache Storm demonstrates that our approach locates optimal configurations within a limited experimental budget, with an improvement of SPS performance typically of at least an order of magnitude compared to existing configuration algorithms.
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...IJCNCJournal
Rapid development of diverse computer architectures and hardware accelerators caused that designing parallel systems faces new problems resulting from their heterogeneity. Our implementation of a parallel
system called KernelHive allows to efficiently run applications in a heterogeneous environment consisting
of multiple collections of nodes with different types of computing devices. The execution engine of the
system is open for optimizer implementations, focusing on various criteria. In this paper, we propose a new
optimizer for KernelHive, that utilizes distributed databases and performs data prefetching to optimize the
execution time of applications, which process large input data. Employing a versatile data management
scheme, which allows combining various distributed data providers, we propose using NoSQL databases
for our purposes. We support our solution with results of experiments with real executions of our OpenCL
implementation of a regular expression matching application in various hardware configurations.
Additionally, we propose a network-aware scheduling scheme for selecting hardware for the proposed
optimizer and present simulations that demonstrate its advantages.
Transfer Learning for Improving Model Predictions in Robotic SystemsPooyan Jamshidi
Modern software systems are now being built to be used in dynamic environments utilizing configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and, therefore, we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at low cost.
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYcsandit
This paper presents a parallel approach to improve the time complexity problem associated
with sequential algorithms. An image steganography algorithm in transform domain is
considered for implementation. Image steganography is a technique to hide secret message in
an image. With the parallel implementation, large message can be hidden in large image since
it does not take much processing time. It is implemented on GPU systems. Parallel
programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is
processed
Decision tree clustering a columnstores tuple reconstructioncsandit
Column-Stores has gained market share due to promi
sing physical storage alternative for
analytical queries. However, for multi-attribute qu
eries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. T
his paper presents an adaptive approach for
reducing tuple reconstruction time. Proposed approa
ch exploits decision tree algorithm to
cluster attributes for each projection and also eli
minates frequent database scanning.
Experimentations with TPC-H data shows the effectiv
eness of proposed approach.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
A popular programming model for running data intensive applications on the cloud is map reduce. In
the Hadoop usually, jobs are scheduled in FIFO order by default. There are many map reduce
applications which require strict deadline. In Hadoop framework, scheduler wi t h deadline
con s t ra in t s has not been implemented. Existing schedulers d o not guarantee that the job will be
completed by a specific deadline. Some schedulers address the issue of deadlines but focus more on
improving s y s t em utilization. We have proposed an algorithm which facilitates the user to
specify a jobs deadline and evaluates whether the job can be finished before the deadline.
Scheduler with deadlines for Hadoop, which ensures that only jobs, whose deadlines can be met are
scheduled for execution. If the job submitted does not satisfy the specified deadline, physical or
virtual nodes can be added dynamically to complete the job within deadline[8].
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Pooyan Jamshidi
Modern software systems are now being built to be used in dynamic environments utilizing configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and, therefore, we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at low cost.
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEMIJCSEA Journal
Task assignment is one of the most challenging problems in distributed computing environment. An optimal task assignment guarantees minimum turnaround time for a given architecture. Several approaches of optimal task assignment have been proposed by various researchers ranging from graph partitioning based tools to heuristic graph matching. Using heuristic graph matching, it is often impossible to get optimal task assignment for practical test cases within an acceptable time limit. In this paper, we have parallelized the basic heuristic graph-matching algorithm of task assignment which is suitable only for cases where processors and inter processor links are homogeneous. This proposal is a derivative of the basic task assignment methodology using heuristic graph matching. The results show that near optimal assignments are obtained much faster than the sequential program in all the cases with reasonable speed-up.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
The Science Information Network (SINET) is a Japanese academic backbone network for more than 800 universities and research institutions. The characteristic of SINET traffic is that it is enormous and highly variable. In this paper, we present a task-decomposition based anomaly detection of massive and highvolatility session data of SINET. Three main features are discussed: Tash scheduling, Traffic discrimination, and Histogramming. We adopt a task-decomposition based dynamic scheduling method to handle the massive session data stream of SINET. In the experiment, we have analysed SINET traffic from 2/27 to 3/8 and detect some anomalies by LSTM based time-series data processing.
The large-scale cyberinformatics method to replication is defined not only by the analysis of local-area networks, but also by the structured need for the Internet. Here, we confirm the refinement of superpages, which embodies the unfortunate principles of operating systems. SHODE, our new methodology for secure methodologies, is the solution to all of these obstacles.
This article got published in the Software Developer's Journal's February Edition.
It describes the use of MapReduce paradigm to design Clustering algorithms and explain three algorithms using MapReduce.
- K-Means Clustering
- Canopy Clustering
- MinHash Clustering
Transfer Learning for Improving Model Predictions in Robotic SystemsPooyan Jamshidi
Modern software systems are now being built to be used in dynamic environments utilizing configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and, therefore, we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at low cost.
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYcsandit
This paper presents a parallel approach to improve the time complexity problem associated
with sequential algorithms. An image steganography algorithm in transform domain is
considered for implementation. Image steganography is a technique to hide secret message in
an image. With the parallel implementation, large message can be hidden in large image since
it does not take much processing time. It is implemented on GPU systems. Parallel
programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is
processed
Decision tree clustering a columnstores tuple reconstructioncsandit
Column-Stores has gained market share due to promi
sing physical storage alternative for
analytical queries. However, for multi-attribute qu
eries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. T
his paper presents an adaptive approach for
reducing tuple reconstruction time. Proposed approa
ch exploits decision tree algorithm to
cluster attributes for each projection and also eli
minates frequent database scanning.
Experimentations with TPC-H data shows the effectiv
eness of proposed approach.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
A popular programming model for running data intensive applications on the cloud is map reduce. In
the Hadoop usually, jobs are scheduled in FIFO order by default. There are many map reduce
applications which require strict deadline. In Hadoop framework, scheduler wi t h deadline
con s t ra in t s has not been implemented. Existing schedulers d o not guarantee that the job will be
completed by a specific deadline. Some schedulers address the issue of deadlines but focus more on
improving s y s t em utilization. We have proposed an algorithm which facilitates the user to
specify a jobs deadline and evaluates whether the job can be finished before the deadline.
Scheduler with deadlines for Hadoop, which ensures that only jobs, whose deadlines can be met are
scheduled for execution. If the job submitted does not satisfy the specified deadline, physical or
virtual nodes can be added dynamically to complete the job within deadline[8].
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Pooyan Jamshidi
Modern software systems are now being built to be used in dynamic environments utilizing configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and, therefore, we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at low cost.
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEMIJCSEA Journal
Task assignment is one of the most challenging problems in distributed computing environment. An optimal task assignment guarantees minimum turnaround time for a given architecture. Several approaches of optimal task assignment have been proposed by various researchers ranging from graph partitioning based tools to heuristic graph matching. Using heuristic graph matching, it is often impossible to get optimal task assignment for practical test cases within an acceptable time limit. In this paper, we have parallelized the basic heuristic graph-matching algorithm of task assignment which is suitable only for cases where processors and inter processor links are homogeneous. This proposal is a derivative of the basic task assignment methodology using heuristic graph matching. The results show that near optimal assignments are obtained much faster than the sequential program in all the cases with reasonable speed-up.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
The Science Information Network (SINET) is a Japanese academic backbone network for more than 800 universities and research institutions. The characteristic of SINET traffic is that it is enormous and highly variable. In this paper, we present a task-decomposition based anomaly detection of massive and highvolatility session data of SINET. Three main features are discussed: Tash scheduling, Traffic discrimination, and Histogramming. We adopt a task-decomposition based dynamic scheduling method to handle the massive session data stream of SINET. In the experiment, we have analysed SINET traffic from 2/27 to 3/8 and detect some anomalies by LSTM based time-series data processing.
The large-scale cyberinformatics method to replication is defined not only by the analysis of local-area networks, but also by the structured need for the Internet. Here, we confirm the refinement of superpages, which embodies the unfortunate principles of operating systems. SHODE, our new methodology for secure methodologies, is the solution to all of these obstacles.
This article got published in the Software Developer's Journal's February Edition.
It describes the use of MapReduce paradigm to design Clustering algorithms and explain three algorithms using MapReduce.
- K-Means Clustering
- Canopy Clustering
- MinHash Clustering
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5Robert Grossman
This is a talk I gave in San Diego on July 29, 2009 explaining some of the impact and some of the opportunities of cloud computing on predictive analytics.
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
In this paper Compare the performance of two
classification algorithm. I t is useful to differentiate
algorithms based on computational performance rather
than classification accuracy alone. As although
classification accuracy between the algorithms is similar,
computational performance can differ significantly and it
can affect to the final results. So the objective of this paper
is to perform a comparative analysis of two machine
learning algorithms namely, K Nearest neighbor,
classification and Logistic Regression. In this paper it
was considered a large dataset of 7981 data points and 112
features. Then the performance of the above mentioned
machine learning algorithms are examined. In this paper
the processing time and accuracy of the different machine
learning techniques are being estimated by considering the
collected data set, over a 60% for train and remaining
40% for testing. The paper is organized as follows. In
Section I, introduction and background analysis of the
research is included and in section II, problem statement.
In Section III, our application and data analyze Process,
the testing environment, and the Methodology of our
analysis are being described briefly. Section IV comprises
the results of two algorithms. Finally, the paper concludes
with a discussion of future directions for research by
eliminating the problems existing with the current
research methodology.
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...IDES Editor
Cloud computing is an emerging computing
paradigm. It aims to share data, resources and services
transparently among users of a massive grid. Although the
industry has started selling cloud-computing products,
research challenges in various areas, such as architectural
design, task decomposition, task distribution, load
distribution, load scheduling, task coordination, etc. are still
unclear. Therefore, we study the methods to reason and model
cloud computing as a step towards identifying fundamental
research questions in this paradigm. In this paper, we propose
a model for load distribution on cloud computing by modeling
them as cognitive systems and using aspects which not only
depend on the present state of the system, but also, on a set of
predefined transitions and conditions. The entirety of this
model is then bundled to cater the task of job distribution
using the concept of application metadata. Later, we draw a
qualitative and simulation based summarization for the
proposed model. We finally evaluate the results and draw up
a series of key conclusions in cloud computing for future
exploration.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
Presentation by Lee Edlefsen, Revolution Analytics to JSM 2012, San Diego CA, July 30 2012
For the past several decades the rising tide of technology has allowed the same data analysis code to handle the increase in sizes of typical data sets. That era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of RAM, and of hard drives. To deal with this, statistical software must be able to use multiple cores and computers. Parallel external memory algorithms (PEMA's) provide the foundation for such software. External memory algorithms (EMA's) are those that do not require all data to be in RAM, and are widely available. Parallel implementations of EMA's allow them to run on multiple cores and computers, and to process unlimited rows of data. This paper describes a general approach to efficiently parallelizing EMA's, using an R and C++ implementation of GLM as a detailed example. It examines the requirements for efficient PEMA's; the arrangement of code for automatic parallelization; efficient threading; and efficient inter-process communication. It includes billion row benchmarks showing linear scaling with rows and nodes, and demonstrating that extremely high performance is achievable.
Comparative study of optimization algorithms on convolutional network for aut...IJECEIAES
The last 10 years have been the decade of autonomous vehicles. Advances in intelligent sensors and control schemes have shown the possibility of real applications.
Deep learning, and in particular convolutional networks have become a fundamental
tool in the solution of problems related to environment identification, path planning,
vehicle behavior, and motion control. In this paper, we perform a comparative study of
the most used optimization strategies on the convolutional architecture residual neural network (ResNet) for an autonomous driving problem as a previous step to the
development of an intelligent sensor. This sensor, part of our research in reactive
systems for autonomous vehicles, aims to become a system for direct mapping of sensory information to control actions from real-time images of the environment. The
optimization techniques analyzed include stochastic gradient descent (SGD), adaptive gradient (Adagrad), adaptive learning rate (Adadelta), root mean square propagation (RMSProp), Adamax, adaptive moment estimation (Adam), nesterov-accelerated
adaptive moment estimation (Nadam), and follow the regularized leader (Ftrl). The
training of the deep model is evaluated in terms of convergence, accuracy, recall, and
F1-score metrics. Preliminary results show a better performance of the deep network
when using the SGD function as an optimizer, while the Ftrl function presents the
poorest performances.
Feature selection in high-dimensional datasets is
considered to be a complex and time-consuming problem. To
enhance the accuracy of classification and reduce the execution
time, Parallel Evolutionary Algorithms (PEAs) can be used. In
this paper, we make a review for the most recent works which
handle the use of PEAs for feature selection in large datasets.
We have classified the algorithms in these papers into four main
classes (Genetic Algorithms (GA), Particle Swarm Optimization
(PSO), Scattered Search (SS), and Ant Colony Optimization
(ACO)). The accuracy is adopted as a measure to compare the
efficiency of these PEAs. It is noticeable that the Parallel Genetic
Algorithms (PGAs) are the most suitable algorithms for feature
selection in large datasets; since they achieve the highest accuracy.
On the other hand, we found that the Parallel ACO is timeconsuming
and less accurate comparing with other PEA.
Cloud computing is a new computing paradigm that, just as electricity was firstly generated at home and
evolved to be supplied from a few utility providers, aims to transform computing into a utility. It is a mapping
strategy that efficiently equilibrates the task load into multiple computational resources in the network based on the
system status to improve performance. The objective of this research paper is to show the results of Hybrid DEGA,
in which GA is implemented after DE
In the era of big data, even though we have large infrastructure, storage data varies in size,
formats, variety, volume and several platforms such as hadoop, cloud since we have problem associated
with an application how to process the data which is varying in size and format. Data varying in
application and resources available during run time is called dynamic workflow. Using large
infrastructure and huge amount of resources for the analysis of data is time consuming and waste of
resources, it’s better to use scheduling algorithm to analyse the given data set, for efficient execution of
data set without time consuming and evaluate which scheduling algorithm is best and suitable for the
given data set. We evaluate with different data set understand which is the most suitable algorithm for
analysis of data being efficient execution of data set and store the data after analysis
UNIVERSAL ACCOUNT NUMBER (UAN) Manual
UNIVERSAL ACCOUNT NUMBER (UAN) Manual
how to find uan number for pf
uan number epf registration
epf contact number
epf contact number kl
epf helpline number
how to find uan number in epf
old pf number to new pf number
old pf account number
how to get old pf number
how to know my pf account number
how to find the pf number
pf account number search
pf no check
how to know my pf balance
pf account balance enquiry
epf department contact number
epf ambattur contact number
epf contact number
epf contact number kl
epf helpline number
epf claim status contact number
Advanced Tactics and Engagement Strategies for Google+Sunny Kr
Succeeding Liam Walsh is Dan Petrovic, the CEO of Dejan SEO, and a well-known search engine specialist. A well-traveled and seasoned presenter, Dan brought a myriad of underutilised Google + capabilities to our attention.
A scalable gibbs sampler for probabilistic entity linkingSunny Kr
Entity linking involves labeling phrases in text with their referent entities, such as Wikipedia or Freebase entries. This task is challenging due to the large number of possible entities, in the millions, and heavy-tailed mention ambiguity. We formulate the problem in terms of probabilistic inference within a topic model, where each topic is associated with a Wikipedia article. To deal with the large number of topics we propose a novel efficient Gibbs sampling scheme which can also incorporate side information, such as the Wikipedia graph. This conceptually simple probabilistic approach achieves state-of-the-art performance in entity-linking on the Aida-CoNLL dataset.
Through a detailed analysis of logs of activity for all Google employees, this paper shows how the Google Docs suite (documents, spreadsheets and slides) enables and increases collaboration within Google. In particular, visualization and analysis of the evolution of Google’s collaboration network show that new employees, have started collaborating more quickly and with more people as usage of Docs has grown.
comScore Inc. - 2013 Mobile Future in FocusSunny Kr
2012 was another milestone year in the life of mobile as continued innovation in hardware, software and device functionality lays the groundwork for the future of the industry. Smartphones and tablets are ushering in a new era of multi-platform media, with consumers becoming increasingly agnostic about how, when and where they engage with content. This report will examine how these rapidly changing market dynamics have shaped the current U.S. and international mobile marketplaces, and what these changes mean for the coming year as comScore helps bring the mobile future into focus.
Key insights from the 2013 Mobile Future in Focus include:
The U.S. smartphone market finally surpassed 50 percent market penetration and now enters the “late majority” stage of the technology adoption curve. The number of smartphone subscribers has increased 29 percent from a year ago and 99 percent from two years ago.
Google’s Android OS, which has been adopted by multiple OEMs, and Apple’s iOS, which is carried exclusively on iPhones, have come to dominate the U.S. smartphone landscape with nearly 90 percent of the market today.
Apple continues to gain ground as the leading U.S. smartphone OEM, but Samsung has seen the most explosive growth in this market over the past couple of years with a year-over-year increase of more than 100 percent and a two-year increase of more than 400 percent.
The improved availability of high-speed Internet access has significantly enhanced the average user’s media consumption experience, contributing to a rapid uptick in mobile media consumption. Default Wi-Fi accessibility for smartphones and tablets has not only off-loaded bandwidth from networks, but has also contributed to a better on-premise (e.g. in-home) browsing experience for users.
Smartphones have surpassed 125 million U.S. consumers and tablets are now owned by more than 50 million. We have now crossed into the Brave New Digital World – a new paradigm of digital media fragmentation in which consumers are always connected.
Search results clustering (SRC) is a challenging algorithmic
problem that requires grouping together the results returned
by one or more search engines in topically coherent clusters,
and labeling the clusters with meaningful phrases describing
the topics of the results included in them.
Many latent (factorized) models have been
proposed for recommendation tasks like collaborative filtering and for ranking tasks like
document or image retrieval and annotation.
Common to all those methods is that during inference the items are scored independently by their similarity to the query in the
latent embedding space. The structure of the
ranked list (i.e. considering the set of items
returned as a whole) is not taken into account. This can be a problem because the
set of top predictions can be either too diverse (contain results that contradict each
other) or are not diverse enough
Human computation is the technique of performing a computational process by outsourcing some of the difficult-toautomate steps to humans. In the social and behavioral sciences, when using humans as measuring instruments, reproducibility guides the design and evaluation of experiments
Algorithmic entropy can be seen as a special case of entropy as studied in
statistical mechanics. This viewpoint allows us to apply many techniques
developed for use in thermodynamics to the subject of algorithmic information theory. In particular, suppose we fix a universal prefix-free Turing
Optimizing Budget Constrained Spend in Search AdvertisingSunny Kr
Search engine ad auctions typically have a signicant frac-
tion of advertisers who are budget constrained, i.e., if al-
lowed to participate in every auction that they bid on, they
would spend more than their budget. This yields an im-
portant problem: selecting the ad auctions in which these
advertisers participate, in order to optimize dierent system
It is well-known that SRPT is optimal for minimizing
ow time on machines that run one job at a time.
However, running one job at a time is a big under-
utilization for modern systems where sharing, simultane-
ous execution, and virtualization-enabled consolidation
are a common trend to boost utilization. Such machines,
used in modern large data centers and clouds, are
powerful enough to run multiple jobs/VMs at a time
subject to overall CPU, memory, network, and disk
capacity constraints.
Motivated by this pr
We present a linear regression method for predictions on a small data
set making use of a second possibly biased data set that may be much
larger. Our method ts linear regressions to the two data sets while
penalizing the dierence between predictions made by those two models.
Auctions for perishable goods such as internet ad inventory need to make real-time allocation
and pricing decisions as the supply of the good arrives in an online manner, without knowing the
entire supply in advance. These allocation and pricing decisions get complicated when buyers
Approximation Algorithms for the Directed k-Tour and k-Stroll ProblemsSunny Kr
In the Asymmetric Traveling Salesman Problem (ATSP), the input is a directed n-vertex graph G = (V; E) with nonnegative edge lengths, and the goal is to nd a minimum-length tour, visiting
each vertex at least once. ATSP, along with its undirected counterpart, the Traveling Salesman
problem, is a classical combinatorial optimization problem
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Assuring Contact Center Experiences for Your Customers With ThousandEyes
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm
1. HyperLogLog in Practice: Algorithmic Engineering of a
State of The Art Cardinality Estimation Algorithm
Stefan Heule Marc Nunkesser Alexander Hall
ETH Zurich and Google, Inc. Google, Inc. Google, Inc.
stheule@ethz.ch marcnunkesser alexhall@google.com
@google.com
ABSTRACT timate significantly for a range of important cardinalities.
Cardinality estimation has a wide range of applications and We evaluate all improvements empirically and compare with
is of particular importance in database systems. Various the HyperLogLog algorithm from [7]. Our changes to the
algorithms have been proposed in the past, and the Hy- algorithm are generally applicable and not specific to our
perLogLog algorithm is one of them. In this paper, we system. Like HyperLogLog, our proposed improved al-
present a series of improvements to this algorithm that re- gorithm parallelizes perfectly and computes the cardinality
duce its memory requirements and significantly increase its estimate in a single pass.
accuracy for an important range of cardinalities. We have
implemented our proposed algorithm for a system at Google
and evaluated it empirically, comparing it to the original Outline. The remainder of this paper is organized as fol-
HyperLogLog algorithm. Like HyperLogLog, our im- lows; we first justify our algorithm choice and summarize
proved algorithm parallelizes perfectly and computes the related work in Section 2. In Section 3 we give background
cardinality estimate in a single pass. information on our practical use cases at Google and list
the requirements for a cardinality estimation algorithm in
1. INTRODUCTION this context. In Section 4 we present the HyperLogLog
Cardinality estimation is the task of determining the number algorithm from [7] that is the basis of our improved algo-
of distinct elements in a data stream. While the cardinality rithm. In Section 5 we describe the improvements we made
can be easily computed using space linear in the cardinality, to the algorithm, as well as evaluate each of them empiri-
for many applications, this is completely impractical and re- cally. Section 6 explains advantages of HyperLogLog for
quires too much memory. Therefore, many algorithms that dictionary encodings in column stores. Finally, we conclude
approximate the cardinality while using less resources have in Section 7.
been developed. These algorithms play an important role in
network monitoring systems, data mining applications, as 2. RELATED WORK AND ALGORITHM
well as database systems.
CHOICE
Starting with work by Flajolet and Martin [6] a lot of re-
At Google, various data analysis systems such as Sawzall [15],
search has been devoted to the cardinality estimation prob-
Dremel [13] and PowerDrill [9] estimate the cardinality of
lem. See [3, 14] for an overview and a categorization of avail-
very large data sets every day, for example to determine the
able algorithms. There are two popular models to describe
number of distinct search queries on google.com over a time
and analyse cardinality estimation algorithms: Firstly, the
period. Such queries represent a hard challenge in terms of
data streaming (ε, δ)-model [10, 11] analyses the necessary
computational resources, and memory in particular: For the
space to get a (1 ± ε)-approximation with a fixed success
PowerDrill system, a non-negligible fraction of queries his-
probability of δ, for example δ = 2/3, for cardinalities in
torically could not be computed because they exceeded the
{1, . . . , n}. Secondly it is possible to analyse the relative
available memory.
accuracy defined as the standard error of the estimator [3].
In this paper we present a series of improvements to the
The algorithm previously implemented at Google [13, 15, 9]
HyperLogLog algorithm by Flajolet et. al. [7] that esti-
was MinCount, which is presented as algorithm one in [2].
√
mates cardinalities efficiently. Our improvements decrease
It has an accuracy of 1.0/ m, where m is the maximum
memory usage as well as increase the accuracy of the es-
number of hash values maintained (and thus linear in the
required memory). In the (ε, δ)-model, this algorithm needs
O(ε−2 log n) space, and it is near exact for cardinalities up to
Permission to make digital or hard copies of all or part of this work for m (modulo hash collisions). See [8] for the statistical anal-
personal or classroom use is granted without fee provided that copies are ysis and suggestions for making the algorithm more compu-
not made or distributed for profit or commercial advantage and that copies tationally efficient.
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific The algorithm presented by Kane et al. [11] meets the lower
permission and/or a fee.
EDBT/ICDT ’13 March 18 - 22 2013, Genoa, Italy bound on space of Ω(e−2 +log n) in the (ε, δ)-model [10] and
Copyright 2013 ACM 978-1-4503-1597-5/13/03 ...$15.00 is optimal in that sense. However, the algorithm is complex
2. and an actual implementation and its maintenance seems As in most database systems, a user can count the number
out of reach in a practical system. of distinct elements in a data set by issuing a count distinct
query. In many cases, such a query will be grouped by a
In [1], the authors compare six cardinality estimation al- field (e.g., country, or the minute of the day) and counts the
gorithms including MinCount and LogLog [4] combined distinct elements of another field (e.g., query-text of Google
with LinearCounting [16] for small cardinalities. The lat- searches). For this reason, a single query can lead to many
ter algorithm comes out as the winner in that comparison. count distinct computations being carried out in parallel.
In [14] the authors compare 12 algorithms analytically and
the most promising 8 experimentally on a single data set On an average day, PowerDrill performs about 5 million
with 1.9 · 106 distinct elements, among them LogLog, Lin- such count distinct computations. As many as 99% of these
earCounting and MultiresolutionBitmap [5], a mul- computations yield a result of 100 or less. This can be ex-
tiscale version of LinearCounting. LinearCounting is plained partly by the fact that some groups in a group-by
recommended as the algorithm of choice for the tested cardi- query can have only few values, which translates necessarily
nality. LogLog is shown to have a better accuracy than all into a small cardinality. On the other extreme, about 100
other algorithms except LinearCounting and Multires- computations a day yield a result greater than 1 billion.
olutionBitmap on their input data. We are interested in
estimating multisets of much larger cardinalities well be- As to the precision, while most users will be happy with a
yond 109 , for which LinearCounting is no longer attrac- good estimate, there are some critical use cases where a very
tive, as it requires too much memory for an accurate esti- high accuracy is required. In particular, our users value the
mate. MultiresolutionBitmap has similar problems and property of the previously implemented MinCount algo-
needs O(ε−2 log n) space in the (ε, δ)-model, which is grow- rithm [2] that can provide near exact results up to a thresh-
ing faster than the memory usage of LogLog. The authors old and becomes approximate beyond that threshold.
of the study also had problems to run Multiresolution-
Bitmap with a fixed given amount of memory. Therefore, the key requirements for a cardinality estimation
algorithm can be summarized as follows:
HyperLogLog has been proposed by Flajolet et. al. [7] and
is an improvement of LogLog. It has been published after √
the afore-mentioned studies. Its relative error is 1.04/ m • Accuracy. For a fixed amount of memory, the algo-
and it needs O(ε−2 log log n+log n) space in the (ε, δ)-model, rithm should provide as accurate an estimate as pos-
where m is the number of counters (usually less than one sible. Especially for small cardinalities, the results
byte in size). HyperLogLog is shown to be near optimal should be near exact.
among algorithms that are based on order statistics. Its
• Memory efficiency. The algorithm should use the avail-
theoretical properties certify that it has a superior accuracy
able memory efficiently and adapt its memory usage to
for a given fixed amount of memory over MinCount and
the cardinality. That is, the algorithm should use less
many other practical algorithms. The fact that LogLog
than the user-specified maximum amount of memory
(with LinearCounting for small values cardinalities), on
if the cardinality to be estimated is very small.
which HyperLogLog improves, performed so well in the
previous experimental studies confirms our choice. • Estimate large cardinalities. Multisets with cardinali-
ties well beyond 1 billion occur on a daily basis, and
In [12], Lumbroso analyses an algorithm similar to Hyper- it is important that such large cardinalities can be es-
LogLog that uses the inverse of an arithmetic mean instead timated with reasonable accuracy.
of the harmonic mean as evaluation function. Similar to our
empirical bias correction described in Section 5.2, he per- • Practicality. The algorithm should be implementable
forms a bias correction for his estimator that is based on a and maintainable.
full mathematical bias analysis of its “intermediate regime”.
4. THE HYPERLOGLOG ALGORITHM
3. PRACTICAL REQUIREMENTS The HyperLogLog algorithm uses randomization to ap-
FOR CARDINALITY ESTIMATION proximate the cardinality of a multiset. This randomization
In this section we present requirements for a cardinality es- is achieved by using a hash function h that is applied to ev-
timation algorithm to be used in PowerDrill [9]. While we ery element that is to be counted. The algorithm observes
were driven by our particular needs, many of these require- the maximum number of leading zeros that occur for all
ments are more general and apply equally to other applica- hash values, where intuitively hash values with more lead-
tions. ing zeros are less likely and indicate a larger cardinality. If
the bit pattern 0 −1 1 is observed at the beginning of a hash
PowerDrill is a column-oriented datastore as well as an inter- value, then a good estimation of the size of the multiset is 2
active graphical user interface that sends SQL-like queries (assuming the hash function produces uniform hash values).
to its backends. The column store uses highly memory-
optimized data structures to allow low latency queries over To reduce the large variability that such a single measure-
datasets with hundreds of billions of rows. The system heav- ment has, a technique known as stochastic averaging [6] is
ily relies on in-memory caching and to a lesser degree on the used. To that end, the input stream of data elements S is
type of queries produced by the frontend. Typical queries divided into m substreams Si of roughly equal size, using
group by one or more data fields and filter by various crite- the first p bits of the hash values, where m = 2p . In each
ria. substream, the maximum number of leading zeros (after the
3. initial p bits that are used to determine the substream) is Require: Let h : D → {0, 1}32 hash data from domain D.
measured independently. These numbers are kept in an ar- Let m = 2p with p ∈ [4..16].
ray of registers M , where M [i] stores the maximum number Phase 0: Initialization.
of leading zeros plus one for substream with index i. That 1: Define α16 = 0.673, α32 = 0.697, α64 = 0.709,
is, 2: αm = 0.7213/(1 + 1.079/m) for m ≥ 128.
3: Initialize m registers M [0] to M [m − 1] to 0.
M [i] = max (x)
x∈Si Phase 1: Aggregation.
4: for all v ∈ S do
where (x) denotes the number of leading zeros in the bi- 5: x := h(v)
nary representation of x plus one. Note that by convention 6: idx := x31 , . . . , x32−p 2 { First p bits of x }
maxx∈∅ (x) = −∞. Given these registers, the algorithm 7: w := x31−p , . . . , x0 2
then computes the cardinality estimate as the normalized 8: M [idx] := max{M [idx], (w)}
bias corrected harmonic mean of the estimations on the sub- 9: end for
streams as Phase 2: Result computation.
m m−1 −1
2 −M[j] −M [j]
E := αm · m · 2 10: E := αm m · 2
2 { The “raw” estimate }
j=1 j=0
5
where 11: if E ≤ 2 m then
∞ −1
12: Let V be the number of registers equal to 0.
m
2+u 13: if V = 0 then
αm := m log2 du
0
1+u 14: E ∗ := LinearCounting(m, V )
15: else
Full details on the algorithm, as well as an analysis of its 16: E ∗ := E
properties can be found in [7]. In a practical setting, how- 17: end if
ever, this algorithm has a series of problems, which Flajolet 18: else if E ≤ 30 232 then
1
et. al. address by presenting a practical variant of the algo- 19: ∗
E := E
rithm. This second algorithm uses 32 bit hash values with 20: else
the precision argument p in the range [4..16]. The follow- 21: E ∗ := −232 log(1 − E/232 )
ing modifications are applied to the algorithm. For a full 22: end if
explanation of these changes, see [7]. 23: return E ∗
1. Initialization of registers. The registers are initial- Define LinearCounting(m, V )
ized to 0 instead of −∞ to avoid the result 0 for Returns the LinearCounting cardinality estimate.
n m log m where n is the cardinality of the data 24: return m log(m/V )
stream (i.e., the value we are trying to estimate).
Figure 1: The practical variant of the Hyper-
2. Small range correction. Simulations by Flajolet et. al. LogLog algorithm as presented in [7]. We use
show that for n < 5 m nonlinear distortions appear
2 LSB 0 bit numbering.
that need to be corrected. Thus, for this range Lin-
earCounting [16] is used.
3. Large range corrections. When n starts to approach 2L different values, and as the cardinality n approaches 2L ,
232 ≈ 4 · 109 , hash collisions become more and more hash collisions become more and more likely and accurate
likely (due to the 32 bit hash function). To account estimation gets impossible.
for this, a correction is used.
A useful property of the HyperLogLog algorithm is that
The full practical algorithm is shown in Figure 1. In the the memory requirement does not grow linearly with L, un-
remainder of this paper, we refer to it as HllOrig . like other algorithms such as MinCount or LinearCount-
ing. Instead, the memory requirement is determined by the
number of registers and the maximum size of (w) (which
5. IMPROVEMENTS TO HYPERLOGLOG is stored in the registers). For a hash function of L bits and
In this section we propose a number of improvements to the a precision p, this maximum value is L + 1 − p. Thus, the
HyperLogLog algorithm. The improvements are presented memory required for the registers is log2 (L + 1 − p) · 2p
as a series of individual changes, and we assume for every bits. The algorithm HllOrig uses 32 bit hash codes, which
step that all previously presented improvements are kept. requires 5 · 2p bits.
We call the final algorithm HyperLogLog++ and show its
pseudo-code in Figure 6. To fulfill the requirement of being able to estimate multi-
sets of cardinalities beyond 1 billion, we use a 64 bit hash
5.1 Using a 64 Bit Hash Function function. This increases the size of a single register by only
An algorithm that only uses the hash code of the input val- a single bit, leading to a total memory requirement of 6 · 2p .
ues is limited by the number of bits of the hash codes when it Only if the cardinality approaches 264 ≈ 1.8 · 1019 , hash col-
comes to accurately estimating large cardinalities. In par- lisions become a problem; we have not needed to estimate
ticular, a hash function of L bits can at most distinguish inputs with a size close to this value so far.
4. With this change, the large range correction for cardinalities
Algorihm
close to 232 used in HllOrig is no longer needed. It would be 80000
Hʟʟ 64Bɪᴛ
possible to introduce a similar correction if the cardinality
approaches 264 , but it seems unlikely that such cardinali-
ties are encountered in practice. If such cardinalities occur,
however, it might make more sense to increase the number
60000
of bits for the hash function further, especially given the low
additional cost in memory.
Raw estimate
5.2 Estimating Small Cardinalities
The raw estimate of HllOrig (cf. Figure 1, line 10) has a 40000
large error for small cardinalities. For instance, for n = 0
the algorithm always returns roughly 0.7m [7]. To achieve
better estimates for small cardinalities, HllOrig uses Lin-
5
earCounting [16] below a threshold of 2 m and the raw 20000
estimate above that threshold
In simulations, we noticed that most of the error of the raw
estimate is due to bias; the algorithm overestimates the real
cardinality for small sets. The bias is larger for smaller n, 0
e.g., for n = 0 we already mentioned that the bias is about
0 20000 40000 60000 80000
0.7m. The statistical variability of the estimate, however, is Cardinality
small compared to the bias. Therefore, if we can correct for
the bias, we can hope to get a better estimate, in particular
for small cardinalities. Figure 2: The average raw estimate of the Hll64Bit
algorithm to illustrate the bias of this estimator for
p = 14, as well as the 1% and 99% quantiles on 5000
randomly generated data sets per cardinality. Note
Experimental Setup. To measure the bias, we ran a version
that the quantiles and the median almost coincide
of Hll64Bit that does not use LinearCounting and mea-
for small cardinalities; the bias clearly dominates
sured the estimate for a range of different cardinalities. The
the variability in this range.
HyperLogLog algorithm uses a hash function to random-
ize the input data, and will thus, for a fixed hash function
and input, return the same results. To get reliable data we
show the x = y line, which would be the expected value for
ran each experiment for a fixed cardinality and precision on
an unbiased estimator. Note that only if the bias accounts
5000 different randomly generated data sets of that cardi-
for a significant fraction of the overall error can we expect
nality. Intuitively, the distribution of the input set should
a reduced error by correcting for the bias. Our experiments
be irrelevant as long as the hash function ensures an appro-
show that at the latest for n > 5m the correction does no
priate randomization. We were able to convince ourselves of
longer reduce the error significantly.
this by considering various data generation strategies that
produced differently distributed data and ensuring that our
With this data, for any given cardinality we can compute
results were comparable. We use this approach of computing
the observed bias and use it to correct the raw estimate.
results on randomly generated datasets of the given cardi-
As the algorithm does not know the cardinality, we record
nality for all experiments in this paper.
for every cardinality the raw estimate as well as the bias so
that the algorithm can use the raw estimate to look up the
Note that the experiments need to be repeated for every pos-
corresponding bias correction. To make this practical, we
sible precision. For brevity, and since the results are quali-
choose 200 cardinalities as interpolation points, for which
tatively similar, we illustrate the behavior of the algorithm
we record the average raw estimate and bias. We use k-
by considering only precision 14 here and in the remainder
nearest neighbor interpolation to get the bias for a given
of the paper.
raw estimate (for k = 6)1 . In the pseudo-code in Figure 6
we use the procedure EstimateBias that performs the k-
We use the same proprietary 64 bit hash function for all
nearest neighbor interpolation.
experiments. We have tested the algorithm with a variety of
hash functions including MD5, Sha1, Sha256, Murmur3,
as well as several proprietary hash functions. However, in
our experiments we were not able to find any evidence that Deciding Which Algorithm To Use. The procedure de-
any of these hash functions performed significantly better scribed so far gives rise to a new estimator for the cardinality,
than others. namely the bias-corrected raw estimate. This procedure cor-
rects for the bias using the empirically determined data for
cardinalities smaller than 5m and uses the unmodified raw
estimate otherwise. To evaluate how well the bias correction
Empirical Bias Correction. To determine the bias, we
calculate the mean of all raw estimates for a given cardi- 1
The choice of k = 6 is rather arbitrary. The best value of k
nality minus that cardinality. In Figure 2 we show the av- could be determined experimentally, but we found that the
erage raw estimate with 1% and 99% quantiles. We also choice has only a minuscule influence.
5. works, and to decide if this algorithm should be used in fa-
Algorihm
vor of LinearCounting, we perform another experiment,
Raw Estimate of Hʟʟ 64Bɪᴛ
using the bias-corrected raw estimate, the raw estimate as LɪɴᴇᴀʀCᴏᴜɴᴛɪɴɢ
well as LinearCounting. We ran the three algorithms for Bias Corrected Raw Estimate of Hʟʟ 64Bɪᴛ
different cardinalities and compare the distribution of the 0.015
error. Note that we use a different dataset for this second
experiment to avoid overfitting.
Median relative error
As shown in Figure 3, for cardinalities that up to about
61000, the bias-corrected raw estimate has a smaller error
0.010
than the raw estimate. For larger cardinalities, the error of
the two estimators converges to the same level (since the bias
gets smaller in this range), until the two error distributions
coincide for cardinalities above 5m.
For small cardinalities, LinearCounting is still better than 0.005
the bias-corrected raw estimate2 . Therefore, we determine
the intersection of the error curves of the bias-corrected raw
estimate and LinearCounting to be at 11500 for preci-
sion 14 and use LinearCounting to the left, and the bias-
corrected raw estimate to the right of that threshold. 0.000
0 20000 40000 60000 80000
Cardinality
As with the bias correction, the algorithm does not have
access to the true cardinality to decide on which side of
the threshold the cardinality lies, and thus which algorithm Figure 3: The median error of the raw estimate,
should be used. However, again we can use one of the esti- the bias-corrected raw estimate, as well as Lin-
mates to make the decision. Since the threshold is in a range earCounting for p = 14. Also shown are the 5%
where LinearCounting has a fairly small error, we use its and 95% quantiles of the error. The measurements
estimate and compare it with the threshold. We call the re- are based on 5000 data points per cardinality.
sulting algorithm that combines the LinearCounting and
the bias-corrected raw estimate HllNoBias .
5.3 Sparse Representation
HllNoBias requires a constant amount of memory through-
Advantages of Bias Correction. Using the bias-corrected out the execution of 6m bits, regardless of n, violating our
raw estimate in combination with LinearCounting has a memory efficiency requirement. If n m, then most of the
series of advantages compared to combining the raw esti- registers are never used and thus do not have to be repre-
mate and LinearCounting: sented in memory. Instead, we can use a sparse represen-
tation that stores pairs (idx, (w)). If the list of such pairs
would require more memory than the dense representation
• The error for an important range of cardinalities is
of the registers (i.e., 6m bits), the list can be converted to
smaller than the error of Hll64Bit . For precision 14,
the dense representation.
this range is roughly between 18000 and 61000 (cf. Fig-
ure 3).
Note that pairs with the same index can be merged by keep-
ing the one with the highest (w) value. Various strategies
• The resulting algorithm does not have a significant can be used to make insertions of new pairs into the sparse
bias. This is not true for Hll64Bit (or HllOrig ), which representation as well as merging elements with the same in-
uses the raw estimate for cardinalities above the thresh- dex efficient. In our implementation we represent (idx, (w))
old of 5/2m. However, at that point, the raw estimate pairs as a single integer by concatenating the bit patterns
is still significantly biased, as illustrated in Figure 4. for idx and (w) (storing idx in the higher-order bits of the
integer).
• Both algorithms use an empirically determined thresh-
old to decide which of the two sub-algorithms to use. Our implementation then maintains a sorted list of such in-
However, the two relevant error curves for HllNoBias tegers. Furthermore, to enable quick insertion, a separate
are less steep at the threshold compared to Hll64Bit set is kept where new elements can be added quickly with-
(cf. Figure 3). This has the advantage that a small out keeping them sorted. Periodically, this temporary set is
error in the threshold has smaller consequences for the sorted and merged with the list (e.g., if it reaches 25% of the
accuracy of the resulting algorithm. maximum size of the sparse representation), removing any
pairs where another pair with the same index and a higher
2 (w) value exists.
This is not entirely true, for very small cardinalities it
seems that the bias-corrected raw estimate has again a
smaller error, but a higher variability. Since LinearCount- Because the index is stored in the high-order bits of the inte-
ing also has low error, and depends less on empirical data, ger, the sorting ensures that pairs with the same index occur
we decided to use it for all cardinalities below the threshold. consecutively in the sorted sequence, allowing the merge to
6. follows. Let h(v) be the hash value for the underlying data
0.03 Algorihm
element v.
Hʟʟ 64Bɪᴛ
Hʟʟ NᴏBɪᴀs
1. idx consists of the p most significant bits of h(v), and
since p < p , we can determine idx by taking the p
0.02
most significant bits of idx .
2. For (w) we need the number of leading zeros of the
Median relative bias
bits of h(v) after the index bits, i.e., of bits 63 − p to
0. The bits 63 − p to 64 − p are known by looking
at idx . If at least one of these bits is one, then (w)
0.01
can be computed using only those bits. Otherwise,
bits 63 − p to 64 − p are all zero, and using (w ) we
know the number of leading zeros of the remaining bits.
Therefore, in this case we have (w) = (w ) + (p − p).
0.00
This computation is done in DecodeHash of Figure 6. It is
possible to compute at a different, potentially much higher
accuracy p in the sparse representation, without exceeding
the memory limit indicated by the user through the preci-
-0.01 sion parameter p. Note that choosing a suitable value for
0 20000 40000 60000 80000
Cardinality p is a trade-off. The higher p is, the smaller the error for
cases where only the sparse representation is used. However,
at the same time as p gets larger, every pair requires more
Figure 4: The median bias of HllOrig and memory which means the user-specified memory threshold
HllNoBias . The measurements are again based on is reached sooner in the sparse representation and the algo-
5000 data points per cardinality. rithm needs to switch to the dense representation earlier.
Also note that one can increase p up to 64, at which points
happen in a single linear pass over the sorted set and the
the full hash code is kept.
list. In the pseudo-code of Figure 6, this merging happens
in the subroutine Merge.
We use the name HllSparse1 to refer to this algorithm. To
illustrate the increased accuracy, Figure 5 shows the error
The computation of the overall result given a sparse rep-
distribution with and without the sparse representation.
resentation in phase 2 of the algorithm can be done in a
straight forward manner by iterating over all entries in the
list (after the temporary set has been merged with the list), 5.3.2 Compressing the Sparse Representation
and assuming that any index not present in the list has a So far, we presented the sparse representation to use a tem-
register value of 0. As we will explain in the next section, porary set and a list which is kept sorted. Since the tempo-
this is not necessary in our final algorithm and thus is not rary set is used for quickly adding new elements and merged
part of the pseudo-code in Figure 6. with the list before it gets large, using a simple implemen-
tation with some built-in integer type works well, even if
The sparse representation reduces the memory consumption some bits per entry are wasted (due to the fact that built-
for cases where the cardinality n is small, and only adds a in integer types may be too wide). For the list, however,
small runtime overhead by amortizing the cost of searching we can exploit two facts to store the elements more com-
and merging through the use of the temporary set. pactly. First of all, there is an upper limit on the number
of bits used per integer, namely p + 6. Using an integer of
fixed width (e.g., int or long as offered in many program-
5.3.1 Higher Precision for the Sparse Representa- ming languages) might be wasteful. Furthermore, the list is
tion guaranteed to be sorted, which can be exploited as well.
Every item in the sparse representation requires p + 6 bits,
namely to store the index (p bits) and the value of that reg- We use a variable length encoding for integers that uses vari-
isters (6 bits). In the sparse representation we can choose able number of bytes to represent integers, depending on
to perform all operations with a different precision argu- their absolute value. Furthermore, we use a difference en-
ment p > p. This allows us to increase the accuracy in coding, where we store the difference between successive el-
cases where only the sparse representation is used (and it ements in the list. That is, for a sorted list a1 , a2 , a3 , . . . we
is not necessary to convert to the normal representation). would store a1 , a2 − a1 , a3 − a2 , . . .. The values in such a
If the sparse representation gets too large and reaches the list of differences have smaller absolute values, which makes
user-specified memory threshold of 6m bits, it is possible to the variable length encoding even more efficient. Note that
fall back to precision p and switch to the dense represen- when sequentially going through the list, the original items
tation. Note that falling back from p to the lower preci- can easily be recovered.
sion p is always possible: Given a pair (idx , (w )) that has
been determined with precision p , one can determine the We use the name HllSparse2 if only the variable length en-
corresponding pair (idx, (w)) for the smaller precision p as coding is used, and HllSparse3 if additionally the difference
7. 0.015
p m HllSparse1 HllSparse2 HllSparse3 Hll++
Algorihm
Hʟʟ NᴏBɪᴀs (without sparse representation)
Hʟʟ++ (with sparse representation)
10 1024 192.00 316.45 420.73 534.27
12 4096 768.00 1261.82 1962.18 2407.73
14 16384 3072.00 5043.91 8366.45 12107.00
16 65536 12288.00 20174.64 35616.73 51452.64
0.010
Median relative error
Table 1: The maximum number of pairs with dis-
tinct index that can be stored before the represen-
tation reaches the size of the dense representation,
i.e., 6m bits. All measurements have been repeated
for different inputs, for p = 25.
0.005
and using one bit (e.g., the least significant bit) to indicate
whether it is present or not. We use the following encoding:
If the bits x63−p , . . . , x64−p are all 0, then the resulting
integer is
x63 , . . . , x64−p || (w ) || 1
0.000
0 5000 10000 15000 20000 (where we use || to concatenate bits). Otherwise, the pair is
Cardinality encoded as
x63 , . . . , x64−p || 0
Figure 5: Comparison of HllNoBias and Hll++ to
illustrate the increased accuracy, here for p = 25, The least significant bit allows to easily decode the integer
with 5% and 95% quantiles. The measurements are again. Procedures EncodeHash and DecodeHash of Fig-
on 5000 data points per cardinality. ure 7 implement this encoding.
In our implementation3 we fix p = 25, as this provides
encoding is used. very high accuracy for the range of cardinalities where the
sparse representation is used. Furthermore, the 25 bits for
Note that introducing these two compression steps is pos- the index, 6 bits for (w ) and one indicator bit fit nicely into
sible in an efficient way, as the sorted list of encoded hash a 32 bit integer, which is useful from a practical standpoint.
values is only updated in batches (when merging the entries
in the set with the list). Adding a single new value to the We call this algorithm HyperLogLog++ (or Hll++ for short),
compressed list would be expensive, as one has to potentially which is shown in Figure 6 and includes all improvements
read the whole list in order to even find the correct insertion from this paper.
point (due to the difference encoding). However, when the
list is merged with the temporary set, then the list is tra- 5.3.4 Space Efficiency
versed sequentially in any case. The pseudo-code in Figure 7 In Table 1 we show the effects of the different encoding
uses the not further specified subroutine DecodeSparse to strategies on the space efficiency of the sparse encoding for
decompress the variable length and difference encoding in a a selection of precision parameters p. The less memory a
straight forward way. single pair requires on average, the longer can the algorithm
use the sparse representation without switching to the dense
5.3.3 Encoding Hash Values representation. This directly translates to a high precision
for a larger range of cardinalities.
It is possible to further improve the storage efficiency with
the following observations. If the sparse representation is
For instance, for precision 14, storing every element in the
used for the complete aggregation phase, then in the result
sparse representation as an integer would require 32 bits.
computation HllNoBias will always use LinearCounting to
The variable length encoding reduces this to an average of
determine the result. This is because the maximum number
19.49 bits per element. Additionally introducing a difference
of hash values that can be stored in the sparse representa-
encoding requires 11.75 bits per element and using the im-
tion is small compared to the cardinality threshold of where
proved encoding of hash values further decreases this value
to switch from LinearCounting to the bias-corrected raw
to 8.12 bits on average.
estimate. Since LinearCounting only requires the num-
ber of distinct indices (and m), there is no need for (w )
from the pair. The value (w ) is only used when switch- 5.4 Evaluation of All Improvements
ing from the sparse to the normal representation, and even To evaluate the effect of all improvements, we ran HllOrig as
then only if the bits x63−p , . . . , x64−p are all 0. For a good presented in [7] and Hll++. The error distribution clearly
hash function with uniform hash values, the value (w ) only illustrates the positive effects of our changes on the accu-
needs to be stored with probability 2p−p . racy of the estimate. Again, a fresh dataset has been used
3
This holds for all backends except for our own column store,
This idea can be realized by only storing (w ) if necessary, where we use both p = 20 and p = 25, also see Section 6.
8. Input: The input data set S, the precision p, the precision Define LinearCounting(m, V )
p used in the sparse representation where p ∈ [4..p ] and Returns the LinearCounting cardinality estimate.
p ≤ 64. Let h : D → {0, 1}64 hash data from domain D 1: return m log(m/V )
to 64 bit values.
Phase 0: Initialization. Define Threshold(p)
1: m := 2p ; m := 2p Returns empirically determined threshold (we provide
2: α16 := 0.673; α32 := 0.697; α64 := 0.709 the values from our implementation at http://goo.gl/
3: αm := 0.7213/(1 + 1.079/m) for m ≥ 128 iU8Ig).
4: format := sparse
5: tmp_set := ∅ Define EstimateBias(E, p)
6: sparse_list := [] Returns the estimated bias, based on the interpolating
Phase 1: Aggregation. with the empirically determined values.
7: for all v ∈ S do
8: x := h(v) Define EncodeHash(x, p, p )
9: switch format do Encodes the hash code x as an integer.
10: case normal 2: if x63−p , . . . , x64−p = 0 then
11: idx := x63 , . . . , x64−p 2 3: return x63 , . . . , x64−p || ( x63−p , . . . , x0 ) || 1
12: w := x63−p , . . . , x0 2 4: else
13: M [idx] := max{M [idx], (w)} 5: return x63 , . . . , x64−p || 0
14: end case 6: end if
15: case sparse
16: k := EncodeHash(x, p, p ) Define GetIndex(k, p)
17: tmp_set := tmp_set ∪ {k} Returns the index with precision p stored in k
18: if tmp_set is too large then 7: if k0 = 1 then
19: sparse_list := 8: return kp+6 , . . . , k6
20: Merge(sparse_list, Sort(tmp_set)) 9: else
21: tmp_set := ∅ 10: return kp+1 , . . . , k1
22: if |sparse_list| > m · 6 bits then 11: end if
23: format := normal
24: M := ToNormal(sparse_list) Define DecodeHash(k, p, p )
25: end if Returns the index and (w) with precision p stored in k
26: end if 12: if k0 = 1 then
27: end case 13: r := k6 , . . . , k1 + (p − p)
28: end switch 14: else
29: end for 15: r := ( kp −p−1 , . . . , k1 )
Phase 2: Result computation. 16: end if
30: switch format do 17: return (GetIndex(k, p), r)
31: case sparse
32: sparse_list := Merge(sparse_list, Sort(tmp_set)) Define ToNormal(sparse_list, p, p )
33: return LinearCounting(m , m −|sparse_list|) Converts the sparse representation to the normal one
34: end case 18: M := NewArray(m)
35: case normal 19: for all k ∈ DecodeSparse(sparse_list) do
m−1 −1
20: (idx, r) := DecodeHash(k, p, p )
2 −M [j]
36: E := αm m · 2 21: M [idx] := max{M [idx], (w)}
j=0 22: end for
37: E := (E ≤ 5m) ? (E − EstimateBias(E, p)):E 23: return M
38: Let V be the number of registers equal to 0.
39: if V = 0 then Define Merge(a, b)
40: H := LinearCounting(m, V ) Expects two sorted lists a and b, where the first is com-
41: else pressed using a variable length and difference encoding.
42: H := E Returns a list that is sorted and compressed in the same
43: end if way as a, and contains all elements from a and b, except
44: if H ≤ Threshold(p) then for entries where another element with the same index,
45: return H but higher (w) value exists. This can be implemented
46: else in a single linear pass over both lists.
47: return E
48: end if Define DecodeSparse(a)
49: end case Expects a sorted list that is compressed using a variable
50: end switch length and difference encoding, and returns the elements
from that list after it has been decompressed.
Figure 6: The Hll++ algorithm that includes all the
improvements presented in this paper. Some auxil- Figure 7: Auxiliary procedures for the Hll++ algo-
iary procedures are given in Figure 7. rithm.
9. 0.015 As explained in [9], most column stores use a dictionary en-
Algorihm
coding for the data columns that maps values to identifiers.
Hʟʟ Oʀɪɢ
Hʟʟ++
If there is a large number of distinct elements, the size of
the dictionary can easily dominate the memory needed for
a given count distinct query.
HyperLogLog has the useful property that not the full
0.010
Median relative error
hash code is required for the computation. Instead, it suf-
fices to know the first p bits (or p if the sparse representation
with higher accuracy from Section 5.3 is used) as well as the
number of leading zeros of the remaining bits.
This leads to a smaller maximum number of distinct values
0.005
for the data column. While there are 264 possible distinct
hash values if the full hash values were to be stored, there
are only 2p · (64 − p − 1) + 2p · (2p −p − 1) different values
that our integer encoding from Section 5.3.3 can take4 . This
reduces the maximum possible size of the dictionary by a
major factor if p 64
0.000 For example, our column store already bounds the size of
0 20000 40000 60000 80000
Cardinality the dictionary to 10 million (and thus there will never be
264 different hash values). Nonetheless, using the default
parameters p = 20 and p = 14, there can still be at most
Figure 8: Comparison of HllOrig with Hll++. The 1.74 million values, which directly translates to a memory
mean relative error as well as the 5% and 95% quan- saving of more than 82% for the dictionary if all input values
tiles are shown. The measurements are on 5000 data are distinct.
points per cardinality.
7. CONCLUSIONS
for this experiment, and the results of the comparison for
In this paper we presented a series of improvements to the
precision 14 are shown in Figure 8.
HyperLogLog algorithm. Most of these changes are or-
thogonal to each other and can thus be applied indepen-
First of all, there is a spike in the error of HllOrig , almost
dently to fit the needs of a particular application.
exactly at n = 5/2m = 40960. The reason for this is that
the ideal threshold of when to switch from LinearCount-
The resulting algorithm HyperLogLog++ fulfills the re-
ing to the raw estimate in HllOrig is not precisely at 5/2m.
quirements listed in Section 3. Compared to the practical
As explained in Section 5.2, a relatively small error in this
variant of HyperLogLog from [7], the accuracy is signif-
threshold leads to a rather large error in the overall error,
icantly better for large range of cardinalities and equally
because the error curve of the raw estimate is fairly steep.
good on the rest. For precision 14 (and p = 25), the sparse
Furthermore, even if the threshold was determined more pre-
representation allows the average error for cardinalities up
cisely for HllOrig , its error would still be larger than that of
to roughly 12000 to be smaller by a factor of 4. For cardi-
Hll++ in this range of cardinalities. Our HyperLogLog++
nalities between 12000 and 61000, the bias correction allows
algorithm does not exhibit any such behavior.
for a lower error and avoids a spike in the error when switch-
ing between sub-algorithms due to less steep error curves.
The advantage of the sparse representation is clearly visi-
The sparse representation also allows for a more adaptive
ble; for cardinalities smaller than about 12000, the error of
use of memory; if the cardinality n is much smaller than m,
our final algorithm Hll++ is significantly smaller than for
then HyperLogLog++ requires significantly less memory.
HllOrig without a sparse representation.
This is of particular importance in PowerDrill where often
many count distinct computations are carried out in paral-
6. IMPLICATIONS FOR DICTIONARY lel for a single count distinct query. Finally, the use of 64
ENCODINGS OF COLUMN STORES bit hash codes allows the algorithm to estimate cardinalities
In this section we focus on a property of HyperLogLog well beyond 1 billion.
(and Hll++ in particular) that we were able to exploit in
our implementation for the column-store presented in [9]. All of these changes can be implemented in a straight-forward
way and we have done so for the PowerDrill system. We pro-
Given the expression materialization strategy of that col- vide a complete list of our empirically determined parame-
umn store, any cardinality estimation algorithm that com- ters at http://goo.gl/iU8Ig to allow easier reproduction
putes the hash value of an expression will have to add a of our results.
data column of these values to the store. The advantage
of this approach is that the hash values for any given ex- 4
This can be seen as follows: There are 2p (64 − p − 1) many
pression only need to be computed once. Any subsequent encoded values that store (w ), and similarly 2p (2p −p − 1)
computation can use the precomputed data column. many without it.
10. 8. REFERENCES
[1] K. Aouiche and D. Lemire. A comparison of five
probabilistic view-size estimation techniques in OLAP.
In Workshop on Data Warehousing and OLAP
(DOLAP), pages 17–24, 2007.
[2] Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar,
and L. Trevisan. Counting distinct elements in a data
stream. In Workshop on Randomization and
Approximation Techniques (RANDOM), pages 1–10,
London, UK, UK, 2002. Springer-Verlag.
[3] P. Clifford and I. A. Cosma. A statistical analysis of
probabilistic counting algorithms. Scandinavian
Journal of Statistics, pages 1–14, 2011.
[4] M. Durand and P. Flajolet. Loglog counting of large
cardinalities. In G. D. Battista and U. Zwick, editors,
European Symposium on Algorithms (ESA), volume
2832, pages 605–617, 2003.
[5] C. Estan, G. Varghese, and M. Fisk. Bitmap
algorithms for counting active flows on high-speed
links. IEEE/ACM Transactions on Networking, pages
925–937, 2006.
[6] P. Flajolet and G. N. Martin. Probabilistic counting
algorithms for data base applications. Journal of
Computer and System Sciences, 31(2):182–209, 1985.
[7] P. Flajolet, Éric Fusy, O. Gandouet, and F. Meunier.
Hyperloglog: The analysis of a near-optimal
cardinality estimation algorithm. In Analysis of
Algorithms (AOFA), pages 127–146, 2007.
[8] F. Giroire. Order statistics and estimating
cardinalities of massive data sets. Discrete Applied
Mathematics, 157(2):406–427, 2009.
[9] A. Hall, O. Bachmann, R. Büssow, S. Gănceanu, and
M. Nunkesser. Processing a trillion cells per mouse
click. In Very Large Databases (VLDB), 2012.
[10] P. Indyk. Tight lower bounds for the distinct elements
problem. In Foundations of Computer Science
(FOCS), pages 283–288, 2003.
[11] D. M. Kane, J. Nelson, and D. P. Woodruff. An
optimal algorithm for the distinct elements problem.
In Principles of database systems (PODS), pages
41–52. ACM, 2010.
[12] J. Lumbroso. An optimal cardinality estimation
algorithm based on order statistics and its full
analysis. In Analysis of Algorithms (AOFA), pages
489–504, 2010.
[13] S. Melnik, A. Gubarev, J. J. Long, G. Romer,
S. Shivakumar, M. Tolton, T. Vassilakis, and G. Inc.
Dremel: Interactive analysis of web-scale datasets. In
Very Large Databases (VLDB), pages 330–339, 2010.
[14] A. Metwally, D. Agrawal, and A. E. Abbadi. Why go
logarithmic if we can go linear? Towards effective
distinct counting of search traffic. In Extending
database technology (EDBT), pages 618–629, 2008.
[15] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan.
Interpreting the data, parallel analysis with Sawzall.
Journal on Scientific Programming, pages 277–298,
2005.
[16] K.-Y. Whang, B. T. Vander-Zanden, and H. M.
Taylor. A linear-time probabilistic counting algorithm
for database applications. ACM Transactions on
Database Systems, 15:208–229, 1990.