Scalable Simple Random Sampling Algorithms

Scalable Simple Random Sampling Algorithms
Xiangrui Meng
Joint ICME/Statistics Seminar in Data Science
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 1 / 40

Spark workshop (April 4, 2014)
http://icme.stanford.edu/news/2014/spark-workshop
Reza Zadeh (rezab@stanford.edu)
Apache Spark is a fast and general engine for large-scale data processing.

Statistical analysis of big data
Analyzing data sets of billions of records has now become a regular task in
many companies and institutions. The continuous increase in data size
keeps challenging the design of algorithms.

Design and implement new scalable algorithms.
Algorithms:
alternating direction method of multipliers (Boyd et al., 2011)
matrix factorization for recommender systems (Koren et al., 2009)
Libraries:
Vowpal Wabbit, Apache Mahout, H2O, and Spark MLlib

Algorithms:
Libraries:
Reduce the data size and use traditional algorithms.
Sampling is a systematic and cost-eﬀective way, sometimes with
provable performance:
Coresets for k-means and k-median clustering (Har-Peled et al, 2004).
Coresets for 1, 2, and p regression (Clarkson, 2005, Drineas et al.,
2006, Dasgupta et al., 2009, ...)

Algorithms:
Libraries:
Reduce the data size and use traditional algorithms.
Sampling is a systematic and cost-eﬀective way, sometimes with
provable performance:
Coresets for k-means and k-median clustering (Har-Peled et al, 2004).
Coresets for 1, 2, and p regression (Clarkson, 2005, Drineas et al.,
2006, Dasgupta et al., 2009, ...)
However, even the sampling algorithms do not always scale well ...

Outline
1 Simple random sampling without replacement
Existing algorithms
Algorithm ScaSRS
Streaming environments
Stratiﬁed sampling
Empirical evaluation
2 Simple random sampling with replacement
Existing algorithms
Algorithm ScaSRSWR

Outline
Existing algorithms
Algorithm ScaSRS
Existing algorithms
Algorithm ScaSRSWR

Simple random sampling (SRS)
Simple random sampling (Thompson, 2012)
Simple random sampling is a sampling design in which s distinct items are
selected from the n items in the population in such a way that every
possible combination of s items is equally likely to be the sample selected.
SRS is often used as
a sampling technique,
a building block for complex sampling methods.
Given an item set T, which contains n items: t1, . . . , tn, and an integer
s ≤ n, we want to generate a simple random sample of size s from T.

The draw-by-draw method
Draw-by-draw
1: Set S = ∅.
2: for i from 1 to s do
3: Select one item t with equal probability from T − S.
4: Let S := S + {t}.
5: end for
6: Return S.

Draw-by-draw
1: Set S = ∅.
3: Select one item t with equal probability from T − S.
4: Let S := S + {t}.
5: end for
6: Return S.
Selecting one item with equal probability is hard due to
variable-length records,
no indices.
Representing T − S is also hard when data is large.

The selection-rejection algorithm
Selection-rejection (Fan, 1962)
1: Set i = 0.
2: for j from 1 to n do
3: With probability s−i
n−j+1, select tj and let i = i + 1.
4: end for
Pros:
One-pass.
O(1) storage.

The selection-rejection algorithm
Selection-rejection (Fan, 1962)
1: Set i = 0.
2: for j from 1 to n do
3: With probability s−i
n−j+1, select tj and let i = i + 1.
4: end for
Pros:
One-pass.
O(1) storage.
Cons:
Sequential.
Needs both n and s to work.

The reservoir algorithm
Reservoir (Vitter, 1985)
1: The ﬁrst s items are stored into a reservoir R.
2: for i from s + 1 to n do
3: With probability s
i , replace an item from R with equal probability
and let ti take its place.
4: end for
5: Select the items in R.
Pros:
Does not require n.

The reservoir algorithm
Reservoir (Vitter, 1985)
1: The ﬁrst s items are stored into a reservoir R.
2: for i from s + 1 to n do
3: With probability s
i , replace an item from R with equal probability
and let ti take its place.
4: end for
5: Select the items in R.
Pros:
Does not require n.
Cons:
Sequential.
O(s) storage.

The random sort algorithm
Random sort (Sunter, 1977)
1: Associate each ti with an independent key ui drawn from U(0, 1).
2: Sort T in ascending order with regard to the key.
3: Select the smallest s items.
Cons:
A random permutation of the entire data set.

The random sort algorithm
Random sort (Sunter, 1977)
1: Associate each ti with an independent key ui drawn from U(0, 1).
2: Sort T in ascending order with regard to the key.
3: Select the smallest s items.
Cons:
A random permutation of the entire data set.
Pros:
The process of generating {uj } is embarrassingly parallel.
Sorting is scalable.

An example of random sort
Set n = 100, s = 10, and hence the sampling probability p = s/n = 0.1.
1 Generate random keys:
(0.644, t1), (0.378, t2), . . . , (0.587, t10), . . . , (0.500, t99), (0.471, t100)

(0.644, t1), (0.378, t2), . . . , (0.587, t10), . . . , (0.500, t99), (0.471, t100)
2 Sort and select the smallest 10 items:
(0.028, t94), (0.029, t44), . . . , (0.137, t69)
the smallest 10 items
, . . . , (0.980, t26), (0.988, t60)

(0.644, t1), (0.378, t2), . . . , (0.587, t10), . . . , (0.500, t99), (0.471, t100)
2 Sort and select the smallest 10 items:
(0.028, t94), (0.029, t44), . . . , (0.137, t69)
the smallest 10 items
, . . . , (0.980, t26), (0.988, t60)
Fact: the 10th item after the sort is associated with a random key 0.137.

Heuristics
Qualitatively speaking,
if ui is “much larger” than p, then ti is “very unlikely” to be selected;
if ui is “much smaller” than p, then ti is “very likely” to be selected.

Heuristics
Set two thresholds q1 and q2, such that
if ui > q1, reject ti directly;
if ui < q2, select ti directly;
otherwise, put ti onto a waiting list that goes to the sort phase.

Heuristics
Set two thresholds q1 and q2, such that
if ui > q1, reject ti directly;
if ui < q2, select ti directly;
otherwise, put ti onto a waiting list that goes to the sort phase.
The resulting algorithm fails
if we reject more than n − s items,
or if we select more than s items.
Otherwise, it returns the same result as the random sort algorithm.

ScaSRS: a scalable simple random sampling algorithm
1: Let l = 0, and W = ∅ be the waiting list.
2: for each item ti ∈ T do
3: Draw a key ui independently from U(0, 1).
4: if ui < q2 then
5: Select ti and let l := l + 1.
6: else if ui < q1 then
7: Associate ti with ui and add it into W .
8: end if
9: end for
10: Sort W ’s items in the ascending order of the key.
11: Select the smallest pn − l items from W .
ScaSRS outputs the same result as the random sort algorithm given
the same sequence of random keys, if it succeeds.
ScaSRS is embarrassingly parallel.

A quantitative analysis
Theorem
For a ﬁxed δ > 0, if we choose
q1 = min(1, p + γ1 + γ2
1 + 2γ1p), where γ1 = −
log δ
n
,
q2 = max(0, p + γ2 − γ2
2 + 3γ2p), where γ2 = −
2 log δ
3n
,
ScaSRS succeeds with probability at least 1 − 2δ. Moreover, with high
probability, it only needs O(
√
s) storage and runs in O(n) time.

A practical choice of δ
Set δ = 0.00005. We get the following thresholds:
q1 = min 1, p +
10
n
+
100
n2
+
20p
n
,
q2 = max 0, p +
20
3n
−
400
9n2
+
20p
n
,
and ScaSRS succeeds with probability at least 1 − 2δ = 99.99%.

Sketch of proof
Denote Ui the random key associated with item i and let Yi = 1Ui <q1 .
E[Yi ] = q1 and E[Y 2
i ] = q1.
Y = i Yi is the number of un-rejected items during the scan.
Apply a Bernstein-type inequality (Maurer, 2003),
log Pr{Y ≤ pn} ≤ −
(q1 − p)2n
2q1
.
With the choice of q1 in our theorem, we have
Pr{Y ≤ s} ≤ δ.
By similar arguments, we can bound the number of selected items
during the scan:
Pr
i
1Ui <q2 ≥ s ≤ δ.

The size of the waiting list = O(
√
s), w.h.p.
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
48000 48500 49000 49500 50000 50500 51000 51500 52000
n = 1e6, k = 50000, p = 0.05
pdf of number of unrejected items
pdf of number of accepted items
O(sqrt(k))

Streaming environments (when only p is given)
If only p is given, we can update the thresholds q1 and q2 on the ﬂy based
on the number of items seen so far, denoted by j:
q1,i = min 1, p + γ1,i + γ2
1,i + 2γ1,i p , where γ1,i = −
log δ
i
,
q2,i = max 0, p + γ2,i − γ2
2,i + 3γ2,i p , where γ2,i = −
2 log δ
3i
.
O(log n +
√
s log n) storage
at least 1 − 2δ success rate
not necessary to know the exact s but just a good lower bound
use the local count on each process
or a global count updated less frequently

Streaming environments (when only s is given)
When only s is given, we can no longer accept items on the ﬂy because
the sampling probability could be arbitrarily small. However, we can still
reject items on the ﬂy based on s and i:
q1,i = min 1,
s
i
+ γ1,i + γ2
1,i + 2γ1
s
i
, where γ1,i = −
log δ
i
.
s(log n + 1) + O(
√
s + log n) storage
at least 1 − δ success rate

If the item set is heterogeneous, it may be possible to partition it into
several non-overlapping homogeneous subsets, called strata. Applying SRS
within each stratum is preferred to applying SRS to the entire set for
better representativeness. This approach is called stratiﬁed sampling.
Applications:
U.S. Census survey
political survey
Stratiﬁcation:
based on training labels
based on days of the week

Stratiﬁed sampling (cont.)
Applying ScaSRS to stratiﬁed sampling is straightforward. Let m be the
number of strata. We have the following result:
If the size of each stratum is given, we need O(
√
ms) storage.
If only the sampling probability p is given, we need
O(m log n +
√
ms log n) storage.
If only the sample size k is given, we need
s(log n + 1) + O(
√
sm log n) storage.

Empirical evaluation: MapReduce implementation
1: Set l = 0.
2: function map(ti )
3: Generate ui from U(0, 1).
4: if ui < q2 then
5: Select (and output directly) ti .
6: l := l + 1.
7: else if ui < q1 then
8: Emit (ui , ti ).
9: end if
10: end function
11: function reduce(. . . , (ui , ti ), . . .)
12: Select the ﬁrst pn − l items.
13: end function

Empirical evaluation: simple random sampling
P1 P2 P3 P4 P5 P6
n 6.0e7 6.0e7 3.0e8 3.0e8 1.5e9 1.5e9
p 0.01 0.1 0.01 0.1 0.01 0.1
s 6.0e5 6.0e6 3.0e6 3.0e7 1.5e7 1.5e8
Selection-Rejection 281 355 1371 1475 >3600 >3600
Reservoir 288 299 1285 1571 >3600 >3600
Random Sort 513 581 1629 2344 >3600 >3600
ScaSRS 96 103 126 127 140 158
ScaSRSp 98 114 109 139 162 214
W 6.9e3 2.2e4 1.6e4 4.9e4 3.4e4 1.1e5
Wp 5.8e4 1.8e5 2.9e5 9.1e5 1.5e6 4.5e6
Table : Test problems, running times (in seconds), and waiting list sizes.

Empirical evaluation: stratiﬁed sampling
Problem and setup:
23.25 billion page-view events, 7 terabytes, 8 strata.
The ratio between the size of largest strata and that of the smallest
strata is approximately 15000.
p = 0.01.
3000 mappers and 5 reducers.
Result:
509 seconds.
Within the waiting list, the ratio between the size of the largest strata
and that of the smallest strata is 861.2.

Summary for ScaSRS
based on the random sort algorithm
using probabalistic thresholds to decide on the ﬂy whether to select,
reject, or wait-list an item independently of others
embarrassingly parallel
high success rate and O(
√
s) storage
streaming environments
extension to stratiﬁed sampling
straightfoward MapReduce implementation

Outline
Existing algorithms
Algorithm ScaSRS
Existing algorithms
Algorithm ScaSRSWR

Simple random sampling with replacement (SRSWR)
A simple random sample with replacement (SRSWR) of size s from a
population of n items can be thought of as drawing s independent samples
of size 1, where each of the s items in the sample is selected from the
population with equal probability.
An item may appear more than once in the sample.
Equivalent to sample from
Multinomial s,
1
n
,
1
n
, . . . ,
1
n
.

Simple random sampling with replacement (SRSWR)
A simple random sample with replacement (SRSWR) of size s from a
population of n items can be thought of as drawing s independent samples
of size 1, where each of the s items in the sample is selected from the
population with equal probability.
An item may appear more than once in the sample.
Equivalent to sample from
Multinomial s,
1
n
,
1
n
, . . . ,
1
n
.
Applications:
bootstrapping
ensemble methods
generating random tuples

Draw-by-draw
1: Set S = ∅.
3: Select one item t with equal probability from T.
4: Let S := S + {t}.
5: end for
6: Return S.
Selecting one item with equal probability is hard due to
variable-length records,
no indices.

The Poisson-approximation algorithm
Poisson-approximation (Laserson, 2013)
2: Generate a number si from distribution Pois(p).
3: if si > 0 then
4: Repeat ti for si times in the sample.
5: end if
6: end for

The Poisson-approximation algorithm
Poisson-approximation (Laserson, 2013)
2: Generate a number si from distribution Pois(p).
3: if si > 0 then
4: Repeat ti for si times in the sample.
5: end if
6: end for
Pros:
One-pass.
O(1) storage.
Embarrassingly parallel.
Cons:
Variable sample size.

The Poisson-approximation algorithm (cont.)
If Yi ∼ Pois(p), i = 1, . . . , n are independent, then given
n
i=1 Yi = s, (Y1, Y2, . . . , Yn) follows Multinom s, 1
n , 1
n , . . . , 1
n .
If the sample from the Poisson-approximation algorithm happens to
have size s = pn, it is a simple random sample with replacement.

The Poisson-approximation algorithm (cont.)
If Yi ∼ Pois(p), i = 1, . . . , n are independent, then given
n
i=1 Yi = s, (Y1, Y2, . . . , Yn) follows Multinom s, 1
n , 1
n , . . . , 1
n .
If the sample from the Poisson-approximation algorithm happens to
have size s = pn, it is a simple random sample with replacement.
If Xi ∼ Pois(λi ), i = 1, . . . , n are independent and λ = n
i=1 λi , we
have Y = n
i=1 Xi ∼ Pois(λ).
The size of the sample from the Poisson-approximation algorithm
follows distribution Pois(pn).

How to obtain the exact sample size?
To get the exact sample size, we follow an approach similar to what we
have in ScaSRS.
Generate a Poisson sequence to pre-accept items on the ﬂy, where
each value follows Pois(p1) independently for some p1 < p.

have in ScaSRS.
Generate another Poisson sequence to wait-list items, where each
value follows Pois(p2) independently for some p2 > 0.

have in ScaSRS.
Generate another Poisson sequence to wait-list items, where each
value follows Pois(p2) independently for some p2 > 0.
Let a be the number of items we pre-accepted. Select a simple
random sample without replacement of size s − a from the waiting list
and merge it into the ﬁnal sample.

ScaSRSWR: a scalable algorithm for SRSWR
1: Choose δ ∈ (0, 1), p1 < p such that FPois(p1n)(s) ≥ 1 − δ, and p2 > 0
such that FPois((p1+p2)n)(s) ≤ δ.
3: Generate a number s1i from distribution Pois(p1).
4: Include ti for s1i times in the sample.
5: Generate a number s2i from distribution Pois(p2).
6: for j ∈ {1, . . . , s2i } do
7: Draw a value u from U(0, 1) and emit (u, ti ).
8: end for
9: end function
10: function reduce(. . . , (uk, tk), . . .)
11: Let a be the number of accepted items in step 4.
12: Select the ﬁrst s − a items ordered by key.
13: end function

ScaSRSWR: a scalable algorithm for SRSWR (cont.)
Theorem
For a ﬁxed δ > 0, ScaSRSWR outputs a simple random sample with
replacement of size s with probability at least 1 − 2δ. Moreover, with high
probability, it only needs O(
√
s) storage and runs in O(n) time.
The algorithm fails to output a sample of size s if
it pre-accepted too many items, i.e., a > s,
or it wait-listed too few items, i.e., a + w < s, where w is the size of
the waiting list.
So the overall failure rate is at most 2δ. Given a ≤ s and a + w ≥ s, we
can prove that the output is a simple random sample with replacement.

It is possible to extend ScaSRSWR to a streaming environment with some
tweaks. Assuming that only p is given, we need three Poisson sequences:
1 The ﬁrst sequence generates pre-accepted items, where
X1i ∼ Pois(p1i ), which means including ti for X1i times.

2 The second sequence gets “merged” with the ﬁrst sequence at the
end of the stream such that each element in the merged sequence
follows Pois(p1), which is the same as the non-streaming case.

2 The second sequence gets “merged” with the ﬁrst sequence at the
end of the stream such that each element in the merged sequence
3 The third sequence is collected at the end of the stream and
transformed such that each element in the transformed sequence

How to merge two Poisson sequences?
If Xi ∼ Pois(λi ) are independent and λ = i λi , Y = i Xi ∼ Pois(λ).
Suppose the ith item in the ﬁrst sequence follows Pois(p1i ).
We need the ith item in the second sequence follows Pois(p1 − p1i ) in
order to have the sum follows Pois(p1). However, p1 depends on n,
which is unknown until we reach the end of the stream.

How to merge two Poisson sequences?
If Xi ∼ Pois(λi ) are independent and λ = i λi , Y = i Xi ∼ Pois(λ).
Suppose the ith item in the ﬁrst sequence follows Pois(p1i ).
We need the ith item in the second sequence follows Pois(p1 − p1i ) in
order to have the sum follows Pois(p1). However, p1 depends on n,
which is unknown until we reach the end of the stream.
If X ∼ Pois(λ) and Y ∼ Binom(X, p), then Y ∼ Pois(λp).
We can transform the second sequence at the end of the stream to
make each item in the merged sequence follows Pois(p1) as long as
p1i ≤ p1 and p1i + pc
1i ≥ p1, i = 1, . . . , n:
If X1i ∼ Pois(p1i ), Xc
1i ∼ Pois(p − p1i ), and Y c
1i ∼ Binom Xc
1i , p1−p1i
p−p1i
,
then X1i + Y c
1i ∼ Pois(p1).

For any λ > 100, it is easy to verify the following bounds:
Pr{X ≤ λ} > 0.99995, X ∼ Pois(λ − 5
√
λ),
Pr{X < λ} < 1 − 0.99995, X ∼ Pois(λ + 5
√
λ).
Assuming that n > n0 > 100/p, we are going to choose p1 = p − 5 p/n
and p2 = 10 p/n. For each i ∈ {1, . . . , n}, we set
p1i = p − 5 p/ max(i, n0),
pc
1i = p − p1i = 5 p/ max(i, n0),
p2i = 10 p/ max(i, n0).
The algorithm succeeds with probability at least 0.9999. The expected
storage is
n
i=1
(pc
1i + p2i ) ≤
n
i=1
15 p/i = O(
√
pn) = O(
√
s).

Streaming ScaSRSWR (when only p is given)
2: Let p1i = p − 5 p/ max(i, n0) and generate s1i ∼ Pois(p1i ).
4: Let pc
1i = p − p1i and generate sc
1i ∼ Pois(pc
1i ).
5: Emit (0, (p1i , sc
1i , ti )) if sc
1i > 0.
6: Let p2i = 10 p/ max(i, n0) and generate s2i ∼ Pois(p2i ).
7: Emit (1, (p2i , s2i , ti )) if s2i > 0.
8: end function

4: Let pc
1i ∼ Pois(pc
1i ).
5: Emit (0, (p1i , sc
1i , ti )) if sc
1i > 0.
7: Emit (1, (p2i , s2i , ti )) if s2i > 0.
8: end function
9: Compute p1 = p − 5 p/n and p2 = 10 p/n.
10: function reduce(0, [. . . , (p1k , sc
1k , tk ), . . .])
11: Generate ¯sc
1k ∼ Binom(sc
1k , p1−p1k
p−p1k
) and include tk for ¯sc
1k times.
12: end function

4: Let pc
1i ∼ Pois(pc
1i ).
5: Emit (0, (p1i , sc
1i , ti )) if sc
1i > 0.
7: Emit (1, (p2i , s2i , ti )) if s2i > 0.
8: end function
9: Compute p1 = p − 5 p/n and p2 = 10 p/n.
10: function reduce(0, [. . . , (p1k , sc
1k , tk ), . . .])
11: Generate ¯sc
1k ∼ Binom(sc
1k , p1−p1k
p−p1k
) and include tk for ¯sc
1k times.
12: end function
13: Let a be the number of accepted items and W = ∅.
14: function reduce(1, [. . . , (p2k , s2k , tk ), . . .])
15: Generate ¯s2k ∼ Binom(s2k , p2
p2k
) and add tk to W for ¯s2k times.
16: end function
17: Select a simple random sample of size s − a from W and output it.

Streaming ScaSRSWR (when only s is given)
When only s is given, we can no longer accept items on the ﬂy because
the sampling probability could be arbitrarily small. However, we can still
generate a Poisson sequence as the waiting list W :
Choose Xi ∼ Pois((s + 5
√
s)/ max(i, n0)).
At the end of the stream, we can adjust the sequence using Binomial
values to make the Poission numbers i.i.d..
The sum of adjusted sequence is greater than s with high probability.
Storage: O( i (s + 5
√
s)/ max(i, n0)) = O(s log n).
After adjustment, generate a simple random sample without replacement
of size s from W and output it as the ﬁnal sample.

Summary
Scalable simple random sampling algorithms — ScaSRS and ScaSRSWR:
independently select, reject, or wait-list an item on the ﬂy
embarrassingly parallel
high success rate and O(
√
s) storage
streaming environments
extension to stratiﬁed sampling
open-source implementations:
Apache Spark: https://github.com/mengxr/spark-sampling/
Apache DataFu: http://datafu.incubator.apache.org/

Scalable Simple Random Sampling Algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scalable Simple Random Sampling Algorithms

Similar to Scalable Simple Random Sampling Algorithms (20)

Recently uploaded

Recently uploaded (20)

Scalable Simple Random Sampling Algorithms