This document discusses cardinality estimation techniques for very large data sets. It introduces HyperLogLog (HLL), an algorithm for distinct count estimation that uses stochastic averaging and hash-based binning to estimate the cardinality of data sets containing up to billions of elements using only 1.5KB of memory. The document explains how HLL works, including how values are added and cardinality is estimated from the HLL data structure. It also discusses extensions like HLL++ and related probabilistic data structures.
A practical Introduction to Machine(s) LearningBruno Gonçalves
The data deluge we currently witnessing presents both opportunities and challenges. Never before have so many aspects of our world been so thoroughly quantified as now and never before has data been so plentiful. On the other hand, the complexity of the analyses required to extract useful information from these piles of data is also rapidly increasing rendering more traditional and simpler approaches simply unfeasible or unable to provide new insights.
In this tutorial we provide a practical introduction to some of the most important algorithms of machine learning that are relevant to the field of Complex Networks in general, with a particular emphasis on the analysis and modeling of empirical data. The goal is to provide the fundamental concepts necessary to make sense of the more sophisticated data analysis approaches that are currently appearing in the literature and to provide a field guide to the advantages an disadvantages of each algorithm.
In particular, we will cover unsupervised learning algorithms such as K-means, Expectation-Maximization, and supervised ones like Support Vector Machines, Neural Networks and Deep Learning. Participants are expected to have a basic understanding of calculus and linear algebra as well as working proficiency with the Python programming language.
A practical Introduction to Machine(s) LearningBruno Gonçalves
The data deluge we currently witnessing presents both opportunities and challenges. Never before have so many aspects of our world been so thoroughly quantified as now and never before has data been so plentiful. On the other hand, the complexity of the analyses required to extract useful information from these piles of data is also rapidly increasing rendering more traditional and simpler approaches simply unfeasible or unable to provide new insights.
In this tutorial we provide a practical introduction to some of the most important algorithms of machine learning that are relevant to the field of Complex Networks in general, with a particular emphasis on the analysis and modeling of empirical data. The goal is to provide the fundamental concepts necessary to make sense of the more sophisticated data analysis approaches that are currently appearing in the literature and to provide a field guide to the advantages an disadvantages of each algorithm.
In particular, we will cover unsupervised learning algorithms such as K-means, Expectation-Maximization, and supervised ones like Support Vector Machines, Neural Networks and Deep Learning. Participants are expected to have a basic understanding of calculus and linear algebra as well as working proficiency with the Python programming language.
Paper Study: Melding the data decision pipelineChenYiHuang5
Melding the data decision pipeline: Decision-Focused Learning for Combinatorial Optimization from AAAI2019.
Derive the math equation from myself and match the same result as two mentioned CMU papers [Donti et. al. 2017, Amos et. al. 2017] while applying the same derivation procedure.
The data deluge we currently witnessing presents both opportunities and challenges. Never before have so many aspects of our world been so thoroughly quantified as now and never before has data been so plentiful. One of the main beneficiaries of the current data glut have been data intensive approaches such as Neural Networks in general and Deep Learning in particular, with many important new practical applications arising in the last couple of years.
In this tutorial we provide a practical introduction to the most important concepts of Neural Networks as we gradually implement simple neural networks from scratch to perform simple tasks such as character recognition. Our emphasis will be on understanding the architecture, training, advantages and pitfalls of practical neural networks.
Participants are expected to have a basic familiarity with calculus and linear algebra as well as working proficiency with the Python programming language.
Robot의 Gait optimization, Gesture Recognition, Optimal Control, Hyper parameter optimization, 신약 신소재 개발을 위한 optimal data sampling strategy등과 같은 ML분야에서 약방의 감초 같은 존재인 GP이지만 이해가 쉽지 않은 GP의 기본적인 이론 및 matlab code 소개
Abstract: This PDSG workshop introduces basic concepts of the softmax equation. The max and softmax equations are contrasted, how softmax is applied to neural networks, along with examples.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
In this project I use a stack of denoising autoencoders to learn low-dimensional
representations of images. These encodings are used as input to a locality sensitive
hashing algorithm to find images similar to a given query image. The results clearly
shows that this approach outperforms basic LSH by far.
Paper Study: Melding the data decision pipelineChenYiHuang5
Melding the data decision pipeline: Decision-Focused Learning for Combinatorial Optimization from AAAI2019.
Derive the math equation from myself and match the same result as two mentioned CMU papers [Donti et. al. 2017, Amos et. al. 2017] while applying the same derivation procedure.
The data deluge we currently witnessing presents both opportunities and challenges. Never before have so many aspects of our world been so thoroughly quantified as now and never before has data been so plentiful. One of the main beneficiaries of the current data glut have been data intensive approaches such as Neural Networks in general and Deep Learning in particular, with many important new practical applications arising in the last couple of years.
In this tutorial we provide a practical introduction to the most important concepts of Neural Networks as we gradually implement simple neural networks from scratch to perform simple tasks such as character recognition. Our emphasis will be on understanding the architecture, training, advantages and pitfalls of practical neural networks.
Participants are expected to have a basic familiarity with calculus and linear algebra as well as working proficiency with the Python programming language.
Robot의 Gait optimization, Gesture Recognition, Optimal Control, Hyper parameter optimization, 신약 신소재 개발을 위한 optimal data sampling strategy등과 같은 ML분야에서 약방의 감초 같은 존재인 GP이지만 이해가 쉽지 않은 GP의 기본적인 이론 및 matlab code 소개
Abstract: This PDSG workshop introduces basic concepts of the softmax equation. The max and softmax equations are contrasted, how softmax is applied to neural networks, along with examples.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
In this project I use a stack of denoising autoencoders to learn low-dimensional
representations of images. These encodings are used as input to a locality sensitive
hashing algorithm to find images similar to a given query image. The results clearly
shows that this approach outperforms basic LSH by far.
lightning talk presented at DevOpsDC on metrics collection at clearspring. provides links to two new open source projects Clearspring created to deal with metrics.
SharePoint is a powerful product and can be many things to its many customers allowing each to customize it for their own unique business needs. Over the past 9 years I've worked with SharePoint, I've learned some valuable lessons and insights into key areas of the product that should come with warning labels to handle with care.
Some choices made in SharePoint can have detrimental side effects and end up to costing organizations time, rework and money to resolve. In this talk I'll share a few common ones I've learned and you will benefit by being able to recognize some of these before they become a problem in your own environment.
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
The book "Probabilistic Data Structures and Algorithms in Big Data Applications" is now available at Amazon and from local bookstores. More details at https://pdsa.gakhov.com
In the presentation, I described common data structures and algorithms to estimate the number of distinct elements in a set (cardinality), such as Linear Counting, HyperLogLog, and HyperLogLog++. Each approach comes with some math that is behind it and simple examples to clarify the theory statements.
Understanding High-dimensional Networks for Continuous Variables Using ECLHPCC Systems
Syed Rahman & Kshitij Khare, University of Florida, present at the 2016 HPCC Systems Engineering Summit Community Day.
The availability of high dimensional data (or “big data”) has touched almost every field of science and industry. Such data, where the number of variables (features) is often much higher than the number of samples, is now more pervasive than it has ever been. Discovering meaningful relationships between the variables in such data is one of the major challenges that modern day data scientists have to contend with.
The covariance matrix of the variables is the most fundamental quantity that can help us understand the complex multivariate relationships in the data. In addition to estimating the inverse covariance matrix, CSCS can be used to detect the edges in a directed acyclic graph, as opposed to the edges an undirected graph, which CONCORD (presented at the 2015 summit) was used for.
Similar to the CONCORD algorithm, the CSCS algorithm works by minimizing a convex objective function through a cyclic coordinate minimization approach. In addition, it is theoretically guaranteed to converge to a global minimum of the objective function. One of the main advantage of CSCS is that each row can be calculated independently of the other rows, and thus we are able to harness the power of distributed computing.
Syed Rahman
Syed Rahman is a PhD student in the Statistics department at the University of Florida working under the supervision of Dr. Kshitij Khare. He is interested in high-dimensional covariance estimation. In 2015, Syed programmed the CONCORD algorithm in ECL and presented this at the HPCC Systems Engineering Summit.
Kshitij Khare
Kshitij Khare is an Associate Professor of Statistics at the University of Florida. He earned his Ph.D. in Statistics from Stanford University in 2009. He has a variety of interests, which include covariance/network estimation in high-dimensional datasets, and Bayesian inference using Markov chain Monte Carlo methods. One of Dr. Khare's major research focus is development of novel statistical methods and algorithms for "big data" or high-dimensional data.
Logic Circuits Design - "Chapter 1: Digital Systems and Information"Ra'Fat Al-Msie'deen
Logic Circuits Design: This material is based on chapter 1 of “Logic and Computer Design Fundamentals” by M. Morris Mano, Charles R. Kime and Tom Martin
A fast-paced introduction to Deep Learning that starts with a simple yet complete neural network (no frameworks), followed by an overview of activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Next we'll create a neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. For best results, familiarity with basic vectors and matrices, inner (aka "dot") products of vectors, and rudimentary Python is definitely helpful.
Introduction to Convolutional Codes
Convolutional Encoder Structure
Convolutional Encoder Representation(Vector, Polynomial, State Diagram and Trellis Representations )
Maximum Likelihood Decoder
Viterbi Algorithm
MATLAB Simulation
Hard and Soft Decisions
Bit Error Rate Tradeoff
Consumed Time Tradeoff
Design of optimized Interval Arithmetic MultiplierVLSICS Design
Many DSP and Control applications that require the user to know how various numerical errors(uncertainty) affect the result. This uncertainty is eliminated by replacing non-interval values with intervals. Since most DSPs operate in real time environments, fast processors are required to implement interval arithmetic. The goal is to develop a platform in which Interval Arithmetic operations are performed at the same computational speed as present day signal processors. So we have proposed the design and implementation of Interval Arithmetic multiplier, which operates with IEEE 754 numbers. The proposed unit consists of a floating point CSD multiplier, Interval operation selector. This architecture implements an algorithm which is faster than conventional algorithm of Interval multiplier . The cost overhead of the proposed unit is 30% with respect to a conventional floating point multiplier. The
performance of proposed architecture is better than that of a conventional CSD floating-point multiplier, as it can perform both interval multiplication and floating-point multiplication as well as Interval comparisons
A fast-paced introduction to Deep Learning (DL) concepts, such as neural networks, back propagation, activation functions, CNNs, RNNs (if time permits), and the CLT/AUT/fixed-point theorems, along with a basic code sample in TensorFlow.
During this session you will learn how to manually create a basic neural network that acts as a classifier, and also the segue from linear regression to a neural network.
You'll also learn about GANs (Generative Adversarial Networks) for static images as well as voice, and the former case, their potential impact on self-driving cars.
An introduction to Deep Learning concepts, with a simple yet complete neural network, CNNs, followed by rudimentary concepts of Keras and TensorFlow, and some simple code fragments.
2. THANKS FOR
COMING!
I build large scale distributed systems and work on
algorithms that make sense of the data stored in
them
Contributor to the open source project Stream-Lib,
a Java library for summarizing data streams
(https://github.com/clearspring/stream-lib)
Ask me questions: @abramsm
3. HOW CAN WE COUNT
THE NUMBER OF
DISTINCT ELEMENTS
IN LARGE DATA
SETS?
4. HOW CAN WE COUNT
THE NUMBER OF
DISTINCT ELEMENTS
IN VERY LARGE DATA
SETS?
5. GOALS FOR
COUNTING SOLUTION
Support high throughput data streams (up
to many 100s of thousands per second)
Estimate cardinality with known error
thresholds in sets up to around 1 billion (or
even 1 trillion when needed)
Support set operations (unions and
intersections)
Support data streams with large number of
dimensions
10. NAÏVE SOLUTIONS
• Select count(distinct
UID) from table where
dimension = foo
• HashSet<K>
• Run a batch job for each
new query request
11. WE ARE NOT A BANK
This means a estimate rather
than exact value is acceptable.
http://graphics8.nytimes.com/images/2008/01/30/timestopics/
feddc.jpg
12.
13. THREE INTUITIONS
• It is possible to estimate the cardinality of a set
by understanding the probability of a sequence
of events occurring in a random variable (e.g.
how many coins were flipped if I saw n heads in
a row?)
• Averaging the the results of multiple
observations can reduce the variance
associated with random variables
• Applying a good hash function effectively de-
duplicates the input stream
14. INTUITION
What is the probability
that a binary string
starts with ’01’?
17. INTUITION
Crude analysis: If a stream
has 8 unique values the hash
of at least one of them should
start with ‘001’
18. INTUITION
Given the variability of a single
random value we can not use
a single variable for accurate
cardinality estimations
19. MULTIPLE OBSERVATIONS HELP
REDUCE VARIANCE
By taking the mean of the standard
deviation of multiple random variables we
can make the error rate as small as desired
by controlling the size of m (the number
random variables)
error = σ / m
20. THE PROBLEM WITH
MULTIPLE HASH
FUNCTIONS
• It is too costly from a
computational perspective to
apply m hash functions to
each data point
• It is not clear that it is possible
to generate m good hash
functions that are independent
21. STOCHASTIC
AVERAGING
• Emulating the effect of m experiments
with a single hash function
• Divide input stream h(Μ) into m sub-
streams
"1 2 m −1 %
$ , ,...,
#m m
,1'
m &
• An average of the observable values for
each sub-stream will yield a cardinality
that improves in proportion to 1 / m as
m increases
22. HASH FUNCTIONS
32 Bit 64 Bit 160 Bit Odds of a
Hash Hash Hash Collision
77163 5.06 Billion 1.42 * 1 in 2
10^14
30084 1.97 Billion 5.55 * 1 in 10
10^23
9292 609 million 1.71 * 1 in 100
10^23
2932 192 million 5.41 * 1 in 1000
10^22
http://preshing.com/20110504/hash-collision-probabilities
23. HYPERLOGLOG
(2007)
Counts up to 1 Billion in 1.5KB of space
Philippe Flajolet (1948-2011)
24. HYPERLOGLOG (HLL)
• Operates with a single pass
over the input data set
• Produces a typical error of of
1.04 / m
• Error decreases as m
increases. Error is not a
function of the number of
elements in the set
25. HLL SUBSTREAMS
HLL uses a single hash
function and splits the result
into m buckets
Bucket 1
Hash
Input Values Function
S Bucket 2
Bucket m
26. HLL ALGORITHM
BASICS
• Each substream maintains an Observable
• Observable is largest value p(x) which is the
position of the leftmost 1-bit in a binary string x
• 32 bit hashing function with 5 bit “short bytes”
• Harmonic mean
• Increases quality of estimates by reducing variance
27. WHAT ARE “SHORT BYTES”?
• We know a priori that the value of a given
substream of the multiset M is in the
range
0..(L +1− log 2 m)
• Assuming L = 32 we only need 5 bits to
store the value of the register
• 85% less memory usage as compared to
standard java int (32 bits)
28. ADDING VALUES TO
HLL
ρ ( xb+1 xb+2 ⋅⋅⋅) index = 1+ x1 x2 ⋅⋅⋅ xb 2
• The first b bits of the new value define the
index for the multiset M that may be
updated when the new value is added
• The bits b+1 to m are used to determine
the leading number of zeros (p)
29. ADDING VALUES TO
HLL
Observations
{M[1], M[2],..., M[m]}
The multiset is updated using the equation:
M[ j] := max(M[ j], ρ (ω ))
Number of leading zeros + 1
30. INTUITION ON
EXTRACTING
CARDINALITY FROM HLL
• If we add n unique elements to a stream then
each substream will contain roughly n/m
elements
• The MAX value in each substream should be
about log 2 ( n / m) (from earlier intuition re
random variables)
• The harmonic mean (mZ) of 2MAX is on the
order of n/m
• So m2Z is on the order of n ß That’s the
cardinality!
31. HLL CARDINALITY
ESTIMATE
−1
$ m
−M [ j ]
'
E := α m m ⋅ & ∑ 2
2
& )
)
% j=1 (
p 2
(2 ) Harmonic Mean
• m2Z has systematic multiplicative bias that needs to be
corrected. This is done by multiplying a constant value
32. A NOTE ON LONG
RANGE CORRECTIONS
• The paper says to apply a long range
correction function when the estimate is
greater than: E > 1 232
30
• The correction function is:
E * := −2 32 log(1− E / 2
• DON’T DO THIS! It doesn’t work and
increases error. Better approach is to
use a bigger/better hash function
33. DEMO TIME!
Lets look at HLL in Action.
http://www.aggregateknowledge.com/science/blog/hll.html
34. HLL UNIONS Root
• Merging two or more HLL
data structures is a MON HLL
similar process to adding
a new value to a single
HLL TUE HLL
• For each register in the
HLL take the max value of
the HLLs you are merging WED
HLL
and the resulting register
set can be used to
estimate the cardinality of THU HLL
the combined sets
FRI HLL
35. HLL INTERSECTION
C = A + B − A∪B
A C B
You must understand the properties
of your sets to know if you can trust
the resulting intersection
36. HYPERLOGLOG++
• Google researches have recently released an
update to the HLL algorithm
• Uses clever encoding/decoding techniques to
create a single data structure that is very
accurate for small cardinality sets and can
estimate sets that have over a trillion elements
in them
• Empirical bias correction. Observations show
that most of the error in HLL comes from the
bias function. Using empirically derived values
significantly reduces error
38. OTHER PROBABILISTIC
DATA STRUCTURES
• Bloom Filters – set membership
detection
• CountMinSketch – estimate number
of occurrences for a given element
• TopK Estimators – estimate the
frequency and top elements from a
stream
39. REFERENCES
• Stream-Lib -
https://github.com/clearspring/stream-lib
• HyperLogLog -
http://citeseerx.ist.psu.edu/viewdoc/summary?
doi=10.1.1.142.9475
• HyperLogLog In Practice -
http://research.google.com/pubs/pub40671.html
• Aggregate Knowledge HLL Blog Posts -
http://blog.aggregateknowledge.com/tag/
hyperloglog/
Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.
Given that a good hash function produces a uniformly random number of 0s and 1s we can make observations about the probability of certain conditions appearing in the hashed value.