SlideShare a Scribd company logo
1 of 61
Download to read offline
Scalable Simple Random Sampling Algorithms
Xiangrui Meng
Joint ICME/Statistics Seminar in Data Science
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 1 / 40
Spark workshop (April 4, 2014)
http://icme.stanford.edu/news/2014/spark-workshop
Reza Zadeh (rezab@stanford.edu)
Apache Spark is a fast and general engine for large-scale data processing.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 2 / 40
Statistical analysis of big data
Analyzing data sets of billions of records has now become a regular task in
many companies and institutions. The continuous increase in data size
keeps challenging the design of algorithms.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 3 / 40
Statistical analysis of big data
Analyzing data sets of billions of records has now become a regular task in
many companies and institutions. The continuous increase in data size
keeps challenging the design of algorithms.
Design and implement new scalable algorithms.
Algorithms:
alternating direction method of multipliers (Boyd et al., 2011)
matrix factorization for recommender systems (Koren et al., 2009)
Libraries:
Vowpal Wabbit, Apache Mahout, H2O, and Spark MLlib
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 3 / 40
Statistical analysis of big data
Analyzing data sets of billions of records has now become a regular task in
many companies and institutions. The continuous increase in data size
keeps challenging the design of algorithms.
Design and implement new scalable algorithms.
Algorithms:
alternating direction method of multipliers (Boyd et al., 2011)
matrix factorization for recommender systems (Koren et al., 2009)
Libraries:
Vowpal Wabbit, Apache Mahout, H2O, and Spark MLlib
Reduce the data size and use traditional algorithms.
Sampling is a systematic and cost-effective way, sometimes with
provable performance:
Coresets for k-means and k-median clustering (Har-Peled et al, 2004).
Coresets for 1, 2, and p regression (Clarkson, 2005, Drineas et al.,
2006, Dasgupta et al., 2009, ...)
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 3 / 40
Statistical analysis of big data
Analyzing data sets of billions of records has now become a regular task in
many companies and institutions. The continuous increase in data size
keeps challenging the design of algorithms.
Design and implement new scalable algorithms.
Algorithms:
alternating direction method of multipliers (Boyd et al., 2011)
matrix factorization for recommender systems (Koren et al., 2009)
Libraries:
Vowpal Wabbit, Apache Mahout, H2O, and Spark MLlib
Reduce the data size and use traditional algorithms.
Sampling is a systematic and cost-effective way, sometimes with
provable performance:
Coresets for k-means and k-median clustering (Har-Peled et al, 2004).
Coresets for 1, 2, and p regression (Clarkson, 2005, Drineas et al.,
2006, Dasgupta et al., 2009, ...)
However, even the sampling algorithms do not always scale well ...
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 3 / 40
Outline
1 Simple random sampling without replacement
Existing algorithms
Algorithm ScaSRS
Streaming environments
Stratified sampling
Empirical evaluation
2 Simple random sampling with replacement
Existing algorithms
Algorithm ScaSRSWR
Streaming environments
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 4 / 40
Outline
1 Simple random sampling without replacement
Existing algorithms
Algorithm ScaSRS
Streaming environments
Stratified sampling
Empirical evaluation
2 Simple random sampling with replacement
Existing algorithms
Algorithm ScaSRSWR
Streaming environments
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 5 / 40
Simple random sampling (SRS)
Simple random sampling (Thompson, 2012)
Simple random sampling is a sampling design in which s distinct items are
selected from the n items in the population in such a way that every
possible combination of s items is equally likely to be the sample selected.
SRS is often used as
a sampling technique,
a building block for complex sampling methods.
Given an item set T, which contains n items: t1, . . . , tn, and an integer
s ≤ n, we want to generate a simple random sample of size s from T.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 6 / 40
The draw-by-draw method
Draw-by-draw
1: Set S = ∅.
2: for i from 1 to s do
3: Select one item t with equal probability from T − S.
4: Let S := S + {t}.
5: end for
6: Return S.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 7 / 40
The draw-by-draw method
Draw-by-draw
1: Set S = ∅.
2: for i from 1 to s do
3: Select one item t with equal probability from T − S.
4: Let S := S + {t}.
5: end for
6: Return S.
Selecting one item with equal probability is hard due to
variable-length records,
no indices.
Representing T − S is also hard when data is large.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 7 / 40
The selection-rejection algorithm
Selection-rejection (Fan, 1962)
1: Set i = 0.
2: for j from 1 to n do
3: With probability s−i
n−j+1, select tj and let i = i + 1.
4: end for
Pros:
One-pass.
O(1) storage.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 8 / 40
The selection-rejection algorithm
Selection-rejection (Fan, 1962)
1: Set i = 0.
2: for j from 1 to n do
3: With probability s−i
n−j+1, select tj and let i = i + 1.
4: end for
Pros:
One-pass.
O(1) storage.
Cons:
Sequential.
Needs both n and s to work.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 8 / 40
The reservoir algorithm
Reservoir (Vitter, 1985)
1: The first s items are stored into a reservoir R.
2: for i from s + 1 to n do
3: With probability s
i , replace an item from R with equal probability
and let ti take its place.
4: end for
5: Select the items in R.
Pros:
Does not require n.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 9 / 40
The reservoir algorithm
Reservoir (Vitter, 1985)
1: The first s items are stored into a reservoir R.
2: for i from s + 1 to n do
3: With probability s
i , replace an item from R with equal probability
and let ti take its place.
4: end for
5: Select the items in R.
Pros:
Does not require n.
Cons:
Sequential.
O(s) storage.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 9 / 40
The random sort algorithm
Random sort (Sunter, 1977)
1: Associate each ti with an independent key ui drawn from U(0, 1).
2: Sort T in ascending order with regard to the key.
3: Select the smallest s items.
Cons:
A random permutation of the entire data set.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 10 / 40
The random sort algorithm
Random sort (Sunter, 1977)
1: Associate each ti with an independent key ui drawn from U(0, 1).
2: Sort T in ascending order with regard to the key.
3: Select the smallest s items.
Cons:
A random permutation of the entire data set.
Pros:
The process of generating {uj } is embarrassingly parallel.
Sorting is scalable.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 10 / 40
An example of random sort
Set n = 100, s = 10, and hence the sampling probability p = s/n = 0.1.
1 Generate random keys:
(0.644, t1), (0.378, t2), . . . , (0.587, t10), . . . , (0.500, t99), (0.471, t100)
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 11 / 40
An example of random sort
Set n = 100, s = 10, and hence the sampling probability p = s/n = 0.1.
1 Generate random keys:
(0.644, t1), (0.378, t2), . . . , (0.587, t10), . . . , (0.500, t99), (0.471, t100)
2 Sort and select the smallest 10 items:
(0.028, t94), (0.029, t44), . . . , (0.137, t69)
the smallest 10 items
, . . . , (0.980, t26), (0.988, t60)
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 11 / 40
An example of random sort
Set n = 100, s = 10, and hence the sampling probability p = s/n = 0.1.
1 Generate random keys:
(0.644, t1), (0.378, t2), . . . , (0.587, t10), . . . , (0.500, t99), (0.471, t100)
2 Sort and select the smallest 10 items:
(0.028, t94), (0.029, t44), . . . , (0.137, t69)
the smallest 10 items
, . . . , (0.980, t26), (0.988, t60)
Fact: the 10th item after the sort is associated with a random key 0.137.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 11 / 40
Heuristics
Qualitatively speaking,
if ui is “much larger” than p, then ti is “very unlikely” to be selected;
if ui is “much smaller” than p, then ti is “very likely” to be selected.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 12 / 40
Heuristics
Qualitatively speaking,
if ui is “much larger” than p, then ti is “very unlikely” to be selected;
if ui is “much smaller” than p, then ti is “very likely” to be selected.
Set two thresholds q1 and q2, such that
if ui > q1, reject ti directly;
if ui < q2, select ti directly;
otherwise, put ti onto a waiting list that goes to the sort phase.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 12 / 40
Heuristics
Qualitatively speaking,
if ui is “much larger” than p, then ti is “very unlikely” to be selected;
if ui is “much smaller” than p, then ti is “very likely” to be selected.
Set two thresholds q1 and q2, such that
if ui > q1, reject ti directly;
if ui < q2, select ti directly;
otherwise, put ti onto a waiting list that goes to the sort phase.
The resulting algorithm fails
if we reject more than n − s items,
or if we select more than s items.
Otherwise, it returns the same result as the random sort algorithm.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 12 / 40
ScaSRS: a scalable simple random sampling algorithm
1: Let l = 0, and W = ∅ be the waiting list.
2: for each item ti ∈ T do
3: Draw a key ui independently from U(0, 1).
4: if ui < q2 then
5: Select ti and let l := l + 1.
6: else if ui < q1 then
7: Associate ti with ui and add it into W .
8: end if
9: end for
10: Sort W ’s items in the ascending order of the key.
11: Select the smallest pn − l items from W .
ScaSRS outputs the same result as the random sort algorithm given
the same sequence of random keys, if it succeeds.
ScaSRS is embarrassingly parallel.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 13 / 40
A quantitative analysis
Theorem
For a fixed δ > 0, if we choose
q1 = min(1, p + γ1 + γ2
1 + 2γ1p), where γ1 = −
log δ
n
,
q2 = max(0, p + γ2 − γ2
2 + 3γ2p), where γ2 = −
2 log δ
3n
,
ScaSRS succeeds with probability at least 1 − 2δ. Moreover, with high
probability, it only needs O(
√
s) storage and runs in O(n) time.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 14 / 40
A practical choice of δ
Set δ = 0.00005. We get the following thresholds:
q1 = min 1, p +
10
n
+
100
n2
+
20p
n
,
q2 = max 0, p +
20
3n
−
400
9n2
+
20p
n
,
and ScaSRS succeeds with probability at least 1 − 2δ = 99.99%.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 15 / 40
Sketch of proof
Denote Ui the random key associated with item i and let Yi = 1Ui <q1 .
E[Yi ] = q1 and E[Y 2
i ] = q1.
Y = i Yi is the number of un-rejected items during the scan.
Apply a Bernstein-type inequality (Maurer, 2003),
log Pr{Y ≤ pn} ≤ −
(q1 − p)2n
2q1
.
With the choice of q1 in our theorem, we have
Pr{Y ≤ s} ≤ δ.
By similar arguments, we can bound the number of selected items
during the scan:
Pr
i
1Ui <q2 ≥ s ≤ δ.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 16 / 40
The size of the waiting list = O(
√
s), w.h.p.
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
48000 48500 49000 49500 50000 50500 51000 51500 52000
n = 1e6, k = 50000, p = 0.05
pdf of number of unrejected items
pdf of number of accepted items
O(sqrt(k))
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 17 / 40
Streaming environments (when only p is given)
If only p is given, we can update the thresholds q1 and q2 on the fly based
on the number of items seen so far, denoted by j:
q1,i = min 1, p + γ1,i + γ2
1,i + 2γ1,i p , where γ1,i = −
log δ
i
,
q2,i = max 0, p + γ2,i − γ2
2,i + 3γ2,i p , where γ2,i = −
2 log δ
3i
.
O(log n +
√
s log n) storage
at least 1 − 2δ success rate
not necessary to know the exact s but just a good lower bound
use the local count on each process
or a global count updated less frequently
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 18 / 40
Streaming environments (when only s is given)
When only s is given, we can no longer accept items on the fly because
the sampling probability could be arbitrarily small. However, we can still
reject items on the fly based on s and i:
q1,i = min 1,
s
i
+ γ1,i + γ2
1,i + 2γ1
s
i
, where γ1,i = −
log δ
i
.
s(log n + 1) + O(
√
s + log n) storage
at least 1 − δ success rate
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 19 / 40
Stratified sampling
If the item set is heterogeneous, it may be possible to partition it into
several non-overlapping homogeneous subsets, called strata. Applying SRS
within each stratum is preferred to applying SRS to the entire set for
better representativeness. This approach is called stratified sampling.
Applications:
U.S. Census survey
political survey
Stratification:
based on training labels
based on days of the week
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 20 / 40
Stratified sampling (cont.)
Applying ScaSRS to stratified sampling is straightforward. Let m be the
number of strata. We have the following result:
If the size of each stratum is given, we need O(
√
ms) storage.
If only the sampling probability p is given, we need
O(m log n +
√
ms log n) storage.
If only the sample size k is given, we need
s(log n + 1) + O(
√
sm log n) storage.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 21 / 40
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 22 / 40
Empirical evaluation: MapReduce implementation
1: Set l = 0.
2: function map(ti )
3: Generate ui from U(0, 1).
4: if ui < q2 then
5: Select (and output directly) ti .
6: l := l + 1.
7: else if ui < q1 then
8: Emit (ui , ti ).
9: end if
10: end function
11: function reduce(. . . , (ui , ti ), . . .)
12: Select the first pn − l items.
13: end function
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 23 / 40
Empirical evaluation: simple random sampling
P1 P2 P3 P4 P5 P6
n 6.0e7 6.0e7 3.0e8 3.0e8 1.5e9 1.5e9
p 0.01 0.1 0.01 0.1 0.01 0.1
s 6.0e5 6.0e6 3.0e6 3.0e7 1.5e7 1.5e8
Selection-Rejection 281 355 1371 1475 >3600 >3600
Reservoir 288 299 1285 1571 >3600 >3600
Random Sort 513 581 1629 2344 >3600 >3600
ScaSRS 96 103 126 127 140 158
ScaSRSp 98 114 109 139 162 214
W 6.9e3 2.2e4 1.6e4 4.9e4 3.4e4 1.1e5
Wp 5.8e4 1.8e5 2.9e5 9.1e5 1.5e6 4.5e6
Table : Test problems, running times (in seconds), and waiting list sizes.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 24 / 40
Empirical evaluation: stratified sampling
Problem and setup:
23.25 billion page-view events, 7 terabytes, 8 strata.
The ratio between the size of largest strata and that of the smallest
strata is approximately 15000.
p = 0.01.
3000 mappers and 5 reducers.
Result:
509 seconds.
Within the waiting list, the ratio between the size of the largest strata
and that of the smallest strata is 861.2.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 25 / 40
Summary for ScaSRS
based on the random sort algorithm
using probabalistic thresholds to decide on the fly whether to select,
reject, or wait-list an item independently of others
embarrassingly parallel
high success rate and O(
√
s) storage
streaming environments
extension to stratified sampling
straightfoward MapReduce implementation
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 26 / 40
Outline
1 Simple random sampling without replacement
Existing algorithms
Algorithm ScaSRS
Streaming environments
Stratified sampling
Empirical evaluation
2 Simple random sampling with replacement
Existing algorithms
Algorithm ScaSRSWR
Streaming environments
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 27 / 40
Simple random sampling with replacement (SRSWR)
A simple random sample with replacement (SRSWR) of size s from a
population of n items can be thought of as drawing s independent samples
of size 1, where each of the s items in the sample is selected from the
population with equal probability.
An item may appear more than once in the sample.
Equivalent to sample from
Multinomial s,
1
n
,
1
n
, . . . ,
1
n
.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 28 / 40
Simple random sampling with replacement (SRSWR)
A simple random sample with replacement (SRSWR) of size s from a
population of n items can be thought of as drawing s independent samples
of size 1, where each of the s items in the sample is selected from the
population with equal probability.
An item may appear more than once in the sample.
Equivalent to sample from
Multinomial s,
1
n
,
1
n
, . . . ,
1
n
.
Applications:
bootstrapping
ensemble methods
generating random tuples
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 28 / 40
The draw-by-draw method
Draw-by-draw
1: Set S = ∅.
2: for i from 1 to s do
3: Select one item t with equal probability from T.
4: Let S := S + {t}.
5: end for
6: Return S.
Selecting one item with equal probability is hard due to
variable-length records,
no indices.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 29 / 40
The Poisson-approximation algorithm
Poisson-approximation (Laserson, 2013)
1: for each item ti ∈ T do
2: Generate a number si from distribution Pois(p).
3: if si > 0 then
4: Repeat ti for si times in the sample.
5: end if
6: end for
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 30 / 40
The Poisson-approximation algorithm
Poisson-approximation (Laserson, 2013)
1: for each item ti ∈ T do
2: Generate a number si from distribution Pois(p).
3: if si > 0 then
4: Repeat ti for si times in the sample.
5: end if
6: end for
Pros:
One-pass.
O(1) storage.
Embarrassingly parallel.
Cons:
Variable sample size.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 30 / 40
The Poisson-approximation algorithm (cont.)
If Yi ∼ Pois(p), i = 1, . . . , n are independent, then given
n
i=1 Yi = s, (Y1, Y2, . . . , Yn) follows Multinom s, 1
n , 1
n , . . . , 1
n .
If the sample from the Poisson-approximation algorithm happens to
have size s = pn, it is a simple random sample with replacement.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 31 / 40
The Poisson-approximation algorithm (cont.)
If Yi ∼ Pois(p), i = 1, . . . , n are independent, then given
n
i=1 Yi = s, (Y1, Y2, . . . , Yn) follows Multinom s, 1
n , 1
n , . . . , 1
n .
If the sample from the Poisson-approximation algorithm happens to
have size s = pn, it is a simple random sample with replacement.
If Xi ∼ Pois(λi ), i = 1, . . . , n are independent and λ = n
i=1 λi , we
have Y = n
i=1 Xi ∼ Pois(λ).
The size of the sample from the Poisson-approximation algorithm
follows distribution Pois(pn).
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 31 / 40
How to obtain the exact sample size?
To get the exact sample size, we follow an approach similar to what we
have in ScaSRS.
Generate a Poisson sequence to pre-accept items on the fly, where
each value follows Pois(p1) independently for some p1 < p.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 32 / 40
How to obtain the exact sample size?
To get the exact sample size, we follow an approach similar to what we
have in ScaSRS.
Generate a Poisson sequence to pre-accept items on the fly, where
each value follows Pois(p1) independently for some p1 < p.
Generate another Poisson sequence to wait-list items, where each
value follows Pois(p2) independently for some p2 > 0.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 32 / 40
How to obtain the exact sample size?
To get the exact sample size, we follow an approach similar to what we
have in ScaSRS.
Generate a Poisson sequence to pre-accept items on the fly, where
each value follows Pois(p1) independently for some p1 < p.
Generate another Poisson sequence to wait-list items, where each
value follows Pois(p2) independently for some p2 > 0.
Let a be the number of items we pre-accepted. Select a simple
random sample without replacement of size s − a from the waiting list
and merge it into the final sample.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 32 / 40
ScaSRSWR: a scalable algorithm for SRSWR
1: Choose δ ∈ (0, 1), p1 < p such that FPois(p1n)(s) ≥ 1 − δ, and p2 > 0
such that FPois((p1+p2)n)(s) ≤ δ.
2: function map(ti )
3: Generate a number s1i from distribution Pois(p1).
4: Include ti for s1i times in the sample.
5: Generate a number s2i from distribution Pois(p2).
6: for j ∈ {1, . . . , s2i } do
7: Draw a value u from U(0, 1) and emit (u, ti ).
8: end for
9: end function
10: function reduce(. . . , (uk, tk), . . .)
11: Let a be the number of accepted items in step 4.
12: Select the first s − a items ordered by key.
13: end function
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 33 / 40
ScaSRSWR: a scalable algorithm for SRSWR (cont.)
Theorem
For a fixed δ > 0, ScaSRSWR outputs a simple random sample with
replacement of size s with probability at least 1 − 2δ. Moreover, with high
probability, it only needs O(
√
s) storage and runs in O(n) time.
The algorithm fails to output a sample of size s if
it pre-accepted too many items, i.e., a > s,
or it wait-listed too few items, i.e., a + w < s, where w is the size of
the waiting list.
So the overall failure rate is at most 2δ. Given a ≤ s and a + w ≥ s, we
can prove that the output is a simple random sample with replacement.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 34 / 40
Streaming environments
It is possible to extend ScaSRSWR to a streaming environment with some
tweaks. Assuming that only p is given, we need three Poisson sequences:
1 The first sequence generates pre-accepted items, where
X1i ∼ Pois(p1i ), which means including ti for X1i times.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 35 / 40
Streaming environments
It is possible to extend ScaSRSWR to a streaming environment with some
tweaks. Assuming that only p is given, we need three Poisson sequences:
1 The first sequence generates pre-accepted items, where
X1i ∼ Pois(p1i ), which means including ti for X1i times.
2 The second sequence gets “merged” with the first sequence at the
end of the stream such that each element in the merged sequence
follows Pois(p1), which is the same as the non-streaming case.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 35 / 40
Streaming environments
It is possible to extend ScaSRSWR to a streaming environment with some
tweaks. Assuming that only p is given, we need three Poisson sequences:
1 The first sequence generates pre-accepted items, where
X1i ∼ Pois(p1i ), which means including ti for X1i times.
2 The second sequence gets “merged” with the first sequence at the
end of the stream such that each element in the merged sequence
follows Pois(p1), which is the same as the non-streaming case.
3 The third sequence is collected at the end of the stream and
transformed such that each element in the transformed sequence
follows Pois(p2), which is the same as the non-streaming case.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 35 / 40
How to merge two Poisson sequences?
If Xi ∼ Pois(λi ) are independent and λ = i λi , Y = i Xi ∼ Pois(λ).
Suppose the ith item in the first sequence follows Pois(p1i ).
We need the ith item in the second sequence follows Pois(p1 − p1i ) in
order to have the sum follows Pois(p1). However, p1 depends on n,
which is unknown until we reach the end of the stream.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 36 / 40
How to merge two Poisson sequences?
If Xi ∼ Pois(λi ) are independent and λ = i λi , Y = i Xi ∼ Pois(λ).
Suppose the ith item in the first sequence follows Pois(p1i ).
We need the ith item in the second sequence follows Pois(p1 − p1i ) in
order to have the sum follows Pois(p1). However, p1 depends on n,
which is unknown until we reach the end of the stream.
If X ∼ Pois(λ) and Y ∼ Binom(X, p), then Y ∼ Pois(λp).
We can transform the second sequence at the end of the stream to
make each item in the merged sequence follows Pois(p1) as long as
p1i ≤ p1 and p1i + pc
1i ≥ p1, i = 1, . . . , n:
If X1i ∼ Pois(p1i ), Xc
1i ∼ Pois(p − p1i ), and Y c
1i ∼ Binom Xc
1i , p1−p1i
p−p1i
,
then X1i + Y c
1i ∼ Pois(p1).
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 36 / 40
Streaming environments
For any λ > 100, it is easy to verify the following bounds:
Pr{X ≤ λ} > 0.99995, X ∼ Pois(λ − 5
√
λ),
Pr{X < λ} < 1 − 0.99995, X ∼ Pois(λ + 5
√
λ).
Assuming that n > n0 > 100/p, we are going to choose p1 = p − 5 p/n
and p2 = 10 p/n. For each i ∈ {1, . . . , n}, we set
p1i = p − 5 p/ max(i, n0),
pc
1i = p − p1i = 5 p/ max(i, n0),
p2i = 10 p/ max(i, n0).
The algorithm succeeds with probability at least 0.9999. The expected
storage is
n
i=1
(pc
1i + p2i ) ≤
n
i=1
15 p/i = O(
√
pn) = O(
√
s).
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 37 / 40
Streaming ScaSRSWR (when only p is given)
1: function map(ti )
2: Let p1i = p − 5 p/ max(i, n0) and generate s1i ∼ Pois(p1i ).
3: Include ti for s1i times in the sample.
4: Let pc
1i = p − p1i and generate sc
1i ∼ Pois(pc
1i ).
5: Emit (0, (p1i , sc
1i , ti )) if sc
1i > 0.
6: Let p2i = 10 p/ max(i, n0) and generate s2i ∼ Pois(p2i ).
7: Emit (1, (p2i , s2i , ti )) if s2i > 0.
8: end function
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 38 / 40
Streaming ScaSRSWR (when only p is given)
1: function map(ti )
2: Let p1i = p − 5 p/ max(i, n0) and generate s1i ∼ Pois(p1i ).
3: Include ti for s1i times in the sample.
4: Let pc
1i = p − p1i and generate sc
1i ∼ Pois(pc
1i ).
5: Emit (0, (p1i , sc
1i , ti )) if sc
1i > 0.
6: Let p2i = 10 p/ max(i, n0) and generate s2i ∼ Pois(p2i ).
7: Emit (1, (p2i , s2i , ti )) if s2i > 0.
8: end function
9: Compute p1 = p − 5 p/n and p2 = 10 p/n.
10: function reduce(0, [. . . , (p1k , sc
1k , tk ), . . .])
11: Generate ¯sc
1k ∼ Binom(sc
1k , p1−p1k
p−p1k
) and include tk for ¯sc
1k times.
12: end function
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 38 / 40
Streaming ScaSRSWR (when only p is given)
1: function map(ti )
2: Let p1i = p − 5 p/ max(i, n0) and generate s1i ∼ Pois(p1i ).
3: Include ti for s1i times in the sample.
4: Let pc
1i = p − p1i and generate sc
1i ∼ Pois(pc
1i ).
5: Emit (0, (p1i , sc
1i , ti )) if sc
1i > 0.
6: Let p2i = 10 p/ max(i, n0) and generate s2i ∼ Pois(p2i ).
7: Emit (1, (p2i , s2i , ti )) if s2i > 0.
8: end function
9: Compute p1 = p − 5 p/n and p2 = 10 p/n.
10: function reduce(0, [. . . , (p1k , sc
1k , tk ), . . .])
11: Generate ¯sc
1k ∼ Binom(sc
1k , p1−p1k
p−p1k
) and include tk for ¯sc
1k times.
12: end function
13: Let a be the number of accepted items and W = ∅.
14: function reduce(1, [. . . , (p2k , s2k , tk ), . . .])
15: Generate ¯s2k ∼ Binom(s2k , p2
p2k
) and add tk to W for ¯s2k times.
16: end function
17: Select a simple random sample of size s − a from W and output it.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 38 / 40
Streaming ScaSRSWR (when only s is given)
When only s is given, we can no longer accept items on the fly because
the sampling probability could be arbitrarily small. However, we can still
generate a Poisson sequence as the waiting list W :
Choose Xi ∼ Pois((s + 5
√
s)/ max(i, n0)).
At the end of the stream, we can adjust the sequence using Binomial
values to make the Poission numbers i.i.d..
The sum of adjusted sequence is greater than s with high probability.
Storage: O( i (s + 5
√
s)/ max(i, n0)) = O(s log n).
After adjustment, generate a simple random sample without replacement
of size s from W and output it as the final sample.
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 39 / 40
Summary
Scalable simple random sampling algorithms — ScaSRS and ScaSRSWR:
independently select, reject, or wait-list an item on the fly
embarrassingly parallel
high success rate and O(
√
s) storage
streaming environments
extension to stratified sampling
open-source implementations:
Apache Spark: https://github.com/mengxr/spark-sampling/
Apache DataFu: http://datafu.incubator.apache.org/
Xiangrui Meng (Databricks) ScaSRS March 3, 2014 40 / 40

More Related Content

What's hot

Sample for Research-Simple Random Sample
Sample for Research-Simple Random SampleSample for Research-Simple Random Sample
Sample for Research-Simple Random SampleDr.Himandra Balalle
 
Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...
Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...
Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...nszakir
 
Chap01 describing data; graphical
Chap01 describing data;  graphicalChap01 describing data;  graphical
Chap01 describing data; graphicalJudianto Nugroho
 
Statistics by DURGESH JHARIYA OF jnv,bn,jbp
Statistics by DURGESH JHARIYA OF jnv,bn,jbpStatistics by DURGESH JHARIYA OF jnv,bn,jbp
Statistics by DURGESH JHARIYA OF jnv,bn,jbpDJJNV
 
Sampling Distribution and Simulation in R
Sampling Distribution and Simulation in RSampling Distribution and Simulation in R
Sampling Distribution and Simulation in RPremier Publishers
 
Chap10 hypothesis testing ; additional topics
Chap10 hypothesis testing ; additional topicsChap10 hypothesis testing ; additional topics
Chap10 hypothesis testing ; additional topicsJudianto Nugroho
 
Sampling and sampling distributions
Sampling and sampling distributionsSampling and sampling distributions
Sampling and sampling distributionsShakeel Nouman
 
Powerpoint sampling distribution
Powerpoint sampling distributionPowerpoint sampling distribution
Powerpoint sampling distributionSusan McCourt
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer Sammer Qader
 
Mann Whitney U Test And Chi Squared
Mann Whitney U Test And Chi SquaredMann Whitney U Test And Chi Squared
Mann Whitney U Test And Chi Squaredguest2137aa
 
Continous random variable.
Continous random variable.Continous random variable.
Continous random variable.Shakeel Nouman
 

What's hot (20)

Sampling distribution
Sampling distributionSampling distribution
Sampling distribution
 
Sample for Research-Simple Random Sample
Sample for Research-Simple Random SampleSample for Research-Simple Random Sample
Sample for Research-Simple Random Sample
 
Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...
Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...
Chapter 5 part2- Sampling Distributions for Counts and Proportions (Binomial ...
 
Chap01 describing data; graphical
Chap01 describing data;  graphicalChap01 describing data;  graphical
Chap01 describing data; graphical
 
Data sampling and probability
Data sampling and probabilityData sampling and probability
Data sampling and probability
 
Statistics by DURGESH JHARIYA OF jnv,bn,jbp
Statistics by DURGESH JHARIYA OF jnv,bn,jbpStatistics by DURGESH JHARIYA OF jnv,bn,jbp
Statistics by DURGESH JHARIYA OF jnv,bn,jbp
 
Sampling Distribution and Simulation in R
Sampling Distribution and Simulation in RSampling Distribution and Simulation in R
Sampling Distribution and Simulation in R
 
Sampling Distribution
Sampling DistributionSampling Distribution
Sampling Distribution
 
Chap10 hypothesis testing ; additional topics
Chap10 hypothesis testing ; additional topicsChap10 hypothesis testing ; additional topics
Chap10 hypothesis testing ; additional topics
 
Resampling methods
Resampling methodsResampling methods
Resampling methods
 
Sampling methods
Sampling methodsSampling methods
Sampling methods
 
Sampling and sampling distributions
Sampling and sampling distributionsSampling and sampling distributions
Sampling and sampling distributions
 
Powerpoint sampling distribution
Powerpoint sampling distributionPowerpoint sampling distribution
Powerpoint sampling distribution
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer
 
Mann Whitney U Test And Chi Squared
Mann Whitney U Test And Chi SquaredMann Whitney U Test And Chi Squared
Mann Whitney U Test And Chi Squared
 
determinatiion of
determinatiion of determinatiion of
determinatiion of
 
Hypothsis testing
Hypothsis testingHypothsis testing
Hypothsis testing
 
Sampling Theory Part 2
Sampling Theory Part 2Sampling Theory Part 2
Sampling Theory Part 2
 
Chapter 4
Chapter 4Chapter 4
Chapter 4
 
Continous random variable.
Continous random variable.Continous random variable.
Continous random variable.
 

Viewers also liked

Stratified Random Sampling
Stratified Random SamplingStratified Random Sampling
Stratified Random Samplingkinnari raval
 
Sampling and Sample Types
Sampling  and Sample TypesSampling  and Sample Types
Sampling and Sample TypesDr. Sunil Kumar
 
Pendugaan parameter
Pendugaan parameterPendugaan parameter
Pendugaan parametersiti Julaeha
 
Types of random sampling
Types of random samplingTypes of random sampling
Types of random samplingStudying
 
Stat prob03 sampling
Stat prob03 samplingStat prob03 sampling
Stat prob03 samplingArif Rahman
 
Bab vii perhitungan sampel dalam epidemiologi 1
Bab vii perhitungan sampel dalam epidemiologi 1Bab vii perhitungan sampel dalam epidemiologi 1
Bab vii perhitungan sampel dalam epidemiologi 1NajMah Usman
 
Random Probability sampling by Sazzad Hossain
Random Probability sampling by  Sazzad HossainRandom Probability sampling by  Sazzad Hossain
Random Probability sampling by Sazzad HossainSazzad Hossain
 
Menentukan populasi dan sampel serta teknik pengambilan sampel
Menentukan populasi dan sampel serta teknik pengambilan sampelMenentukan populasi dan sampel serta teknik pengambilan sampel
Menentukan populasi dan sampel serta teknik pengambilan sampelRian Saifulloh
 
Metode pengambilan sampel
Metode pengambilan sampelMetode pengambilan sampel
Metode pengambilan sampelAinur
 
Teknik pengambilan sampel
Teknik pengambilan sampelTeknik pengambilan sampel
Teknik pengambilan sampelLana Karyatna
 
Sampling methods in educational research
Sampling methods in educational researchSampling methods in educational research
Sampling methods in educational researchRidwanul Mosrur
 
Research objectives
Research objectivesResearch objectives
Research objectivesBruno Mmassy
 
PROBABILITY SAMPLING TECHNIQUES
PROBABILITY SAMPLING TECHNIQUESPROBABILITY SAMPLING TECHNIQUES
PROBABILITY SAMPLING TECHNIQUESAzam Ghaffar
 
Test of hypothesis
Test of hypothesisTest of hypothesis
Test of hypothesisvikramlawand
 
objectives of research
objectives of researchobjectives of research
objectives of researchRishad Rish
 
RESEARCH METHOD - SAMPLING
RESEARCH METHOD - SAMPLINGRESEARCH METHOD - SAMPLING
RESEARCH METHOD - SAMPLINGHafizah Hajimia
 

Viewers also liked (20)

Stratified Random Sampling
Stratified Random SamplingStratified Random Sampling
Stratified Random Sampling
 
Sampling and Sample Types
Sampling  and Sample TypesSampling  and Sample Types
Sampling and Sample Types
 
Pendugaan parameter
Pendugaan parameterPendugaan parameter
Pendugaan parameter
 
Types of random sampling
Types of random samplingTypes of random sampling
Types of random sampling
 
Populasi dan sampel
Populasi dan sampelPopulasi dan sampel
Populasi dan sampel
 
Simple random sampling
Simple random samplingSimple random sampling
Simple random sampling
 
Stat prob03 sampling
Stat prob03 samplingStat prob03 sampling
Stat prob03 sampling
 
Bab vii perhitungan sampel dalam epidemiologi 1
Bab vii perhitungan sampel dalam epidemiologi 1Bab vii perhitungan sampel dalam epidemiologi 1
Bab vii perhitungan sampel dalam epidemiologi 1
 
Random Probability sampling by Sazzad Hossain
Random Probability sampling by  Sazzad HossainRandom Probability sampling by  Sazzad Hossain
Random Probability sampling by Sazzad Hossain
 
Teknik Pengambilan Sampel
Teknik Pengambilan SampelTeknik Pengambilan Sampel
Teknik Pengambilan Sampel
 
Simple random sampling
Simple random samplingSimple random sampling
Simple random sampling
 
Menentukan populasi dan sampel serta teknik pengambilan sampel
Menentukan populasi dan sampel serta teknik pengambilan sampelMenentukan populasi dan sampel serta teknik pengambilan sampel
Menentukan populasi dan sampel serta teknik pengambilan sampel
 
Metode pengambilan sampel
Metode pengambilan sampelMetode pengambilan sampel
Metode pengambilan sampel
 
Teknik pengambilan sampel
Teknik pengambilan sampelTeknik pengambilan sampel
Teknik pengambilan sampel
 
Sampling methods in educational research
Sampling methods in educational researchSampling methods in educational research
Sampling methods in educational research
 
Research objectives
Research objectivesResearch objectives
Research objectives
 
PROBABILITY SAMPLING TECHNIQUES
PROBABILITY SAMPLING TECHNIQUESPROBABILITY SAMPLING TECHNIQUES
PROBABILITY SAMPLING TECHNIQUES
 
Test of hypothesis
Test of hypothesisTest of hypothesis
Test of hypothesis
 
objectives of research
objectives of researchobjectives of research
objectives of research
 
RESEARCH METHOD - SAMPLING
RESEARCH METHOD - SAMPLINGRESEARCH METHOD - SAMPLING
RESEARCH METHOD - SAMPLING
 

Similar to Scalable Simple Random Sampling Algorithms

5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructuresKrish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructuresKrish_ver2
 
Introduction to Exploratory Data Analysis with the sci-analysis Python Package
Introduction to Exploratory Data Analysis with the sci-analysis Python PackageIntroduction to Exploratory Data Analysis with the sci-analysis Python Package
Introduction to Exploratory Data Analysis with the sci-analysis Python PackageChrisMorrow28
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
 
Csse 2014 hmm presentation_ta_ed
Csse 2014 hmm presentation_ta_ed Csse 2014 hmm presentation_ta_ed
Csse 2014 hmm presentation_ta_ed Saad Chahine
 
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...cscpconf
 
lecture 11
lecture 11lecture 11
lecture 11sajinsc
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure Eman magdy
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
DevelopingDataScienceProfession
DevelopingDataScienceProfessionDevelopingDataScienceProfession
DevelopingDataScienceProfessionGary Rector
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
10 Algorithms in data mining
10 Algorithms in data mining10 Algorithms in data mining
10 Algorithms in data miningGeorge Ang
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalAlexander Decker
 

Similar to Scalable Simple Random Sampling Algorithms (20)

5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
ictir2016
ictir2016ictir2016
ictir2016
 
Introduction to Exploratory Data Analysis with the sci-analysis Python Package
Introduction to Exploratory Data Analysis with the sci-analysis Python PackageIntroduction to Exploratory Data Analysis with the sci-analysis Python Package
Introduction to Exploratory Data Analysis with the sci-analysis Python Package
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
Csse 2014 hmm presentation_ta_ed
Csse 2014 hmm presentation_ta_ed Csse 2014 hmm presentation_ta_ed
Csse 2014 hmm presentation_ta_ed
 
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...
AN ALTERNATIVE APPROACH FOR SELECTION OF PSEUDO RANDOM NUMBERS FOR ONLINE EXA...
 
Data science with Perl & Raku
Data science with Perl & RakuData science with Perl & Raku
Data science with Perl & Raku
 
lecture 11
lecture 11lecture 11
lecture 11
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
HPC_NIST_SHA3
HPC_NIST_SHA3HPC_NIST_SHA3
HPC_NIST_SHA3
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
05
0505
05
 
Sampling
 Sampling Sampling
Sampling
 
DevelopingDataScienceProfession
DevelopingDataScienceProfessionDevelopingDataScienceProfession
DevelopingDataScienceProfession
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
10 Algorithms in data mining
10 Algorithms in data mining10 Algorithms in data mining
10 Algorithms in data mining
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incremental
 

Recently uploaded

Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2John Carlo Rollon
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxBREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxPABOLU TEJASREE
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxEran Akiva Sinbar
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 

Recently uploaded (20)

Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxBREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 

Scalable Simple Random Sampling Algorithms

  • 1. Scalable Simple Random Sampling Algorithms Xiangrui Meng Joint ICME/Statistics Seminar in Data Science Xiangrui Meng (Databricks) ScaSRS March 3, 2014 1 / 40
  • 2. Spark workshop (April 4, 2014) http://icme.stanford.edu/news/2014/spark-workshop Reza Zadeh (rezab@stanford.edu) Apache Spark is a fast and general engine for large-scale data processing. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 2 / 40
  • 3. Statistical analysis of big data Analyzing data sets of billions of records has now become a regular task in many companies and institutions. The continuous increase in data size keeps challenging the design of algorithms. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 3 / 40
  • 4. Statistical analysis of big data Analyzing data sets of billions of records has now become a regular task in many companies and institutions. The continuous increase in data size keeps challenging the design of algorithms. Design and implement new scalable algorithms. Algorithms: alternating direction method of multipliers (Boyd et al., 2011) matrix factorization for recommender systems (Koren et al., 2009) Libraries: Vowpal Wabbit, Apache Mahout, H2O, and Spark MLlib Xiangrui Meng (Databricks) ScaSRS March 3, 2014 3 / 40
  • 5. Statistical analysis of big data Analyzing data sets of billions of records has now become a regular task in many companies and institutions. The continuous increase in data size keeps challenging the design of algorithms. Design and implement new scalable algorithms. Algorithms: alternating direction method of multipliers (Boyd et al., 2011) matrix factorization for recommender systems (Koren et al., 2009) Libraries: Vowpal Wabbit, Apache Mahout, H2O, and Spark MLlib Reduce the data size and use traditional algorithms. Sampling is a systematic and cost-effective way, sometimes with provable performance: Coresets for k-means and k-median clustering (Har-Peled et al, 2004). Coresets for 1, 2, and p regression (Clarkson, 2005, Drineas et al., 2006, Dasgupta et al., 2009, ...) Xiangrui Meng (Databricks) ScaSRS March 3, 2014 3 / 40
  • 6. Statistical analysis of big data Analyzing data sets of billions of records has now become a regular task in many companies and institutions. The continuous increase in data size keeps challenging the design of algorithms. Design and implement new scalable algorithms. Algorithms: alternating direction method of multipliers (Boyd et al., 2011) matrix factorization for recommender systems (Koren et al., 2009) Libraries: Vowpal Wabbit, Apache Mahout, H2O, and Spark MLlib Reduce the data size and use traditional algorithms. Sampling is a systematic and cost-effective way, sometimes with provable performance: Coresets for k-means and k-median clustering (Har-Peled et al, 2004). Coresets for 1, 2, and p regression (Clarkson, 2005, Drineas et al., 2006, Dasgupta et al., 2009, ...) However, even the sampling algorithms do not always scale well ... Xiangrui Meng (Databricks) ScaSRS March 3, 2014 3 / 40
  • 7. Outline 1 Simple random sampling without replacement Existing algorithms Algorithm ScaSRS Streaming environments Stratified sampling Empirical evaluation 2 Simple random sampling with replacement Existing algorithms Algorithm ScaSRSWR Streaming environments Xiangrui Meng (Databricks) ScaSRS March 3, 2014 4 / 40
  • 8. Outline 1 Simple random sampling without replacement Existing algorithms Algorithm ScaSRS Streaming environments Stratified sampling Empirical evaluation 2 Simple random sampling with replacement Existing algorithms Algorithm ScaSRSWR Streaming environments Xiangrui Meng (Databricks) ScaSRS March 3, 2014 5 / 40
  • 9. Simple random sampling (SRS) Simple random sampling (Thompson, 2012) Simple random sampling is a sampling design in which s distinct items are selected from the n items in the population in such a way that every possible combination of s items is equally likely to be the sample selected. SRS is often used as a sampling technique, a building block for complex sampling methods. Given an item set T, which contains n items: t1, . . . , tn, and an integer s ≤ n, we want to generate a simple random sample of size s from T. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 6 / 40
  • 10. The draw-by-draw method Draw-by-draw 1: Set S = ∅. 2: for i from 1 to s do 3: Select one item t with equal probability from T − S. 4: Let S := S + {t}. 5: end for 6: Return S. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 7 / 40
  • 11. The draw-by-draw method Draw-by-draw 1: Set S = ∅. 2: for i from 1 to s do 3: Select one item t with equal probability from T − S. 4: Let S := S + {t}. 5: end for 6: Return S. Selecting one item with equal probability is hard due to variable-length records, no indices. Representing T − S is also hard when data is large. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 7 / 40
  • 12. The selection-rejection algorithm Selection-rejection (Fan, 1962) 1: Set i = 0. 2: for j from 1 to n do 3: With probability s−i n−j+1, select tj and let i = i + 1. 4: end for Pros: One-pass. O(1) storage. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 8 / 40
  • 13. The selection-rejection algorithm Selection-rejection (Fan, 1962) 1: Set i = 0. 2: for j from 1 to n do 3: With probability s−i n−j+1, select tj and let i = i + 1. 4: end for Pros: One-pass. O(1) storage. Cons: Sequential. Needs both n and s to work. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 8 / 40
  • 14. The reservoir algorithm Reservoir (Vitter, 1985) 1: The first s items are stored into a reservoir R. 2: for i from s + 1 to n do 3: With probability s i , replace an item from R with equal probability and let ti take its place. 4: end for 5: Select the items in R. Pros: Does not require n. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 9 / 40
  • 15. The reservoir algorithm Reservoir (Vitter, 1985) 1: The first s items are stored into a reservoir R. 2: for i from s + 1 to n do 3: With probability s i , replace an item from R with equal probability and let ti take its place. 4: end for 5: Select the items in R. Pros: Does not require n. Cons: Sequential. O(s) storage. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 9 / 40
  • 16. The random sort algorithm Random sort (Sunter, 1977) 1: Associate each ti with an independent key ui drawn from U(0, 1). 2: Sort T in ascending order with regard to the key. 3: Select the smallest s items. Cons: A random permutation of the entire data set. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 10 / 40
  • 17. The random sort algorithm Random sort (Sunter, 1977) 1: Associate each ti with an independent key ui drawn from U(0, 1). 2: Sort T in ascending order with regard to the key. 3: Select the smallest s items. Cons: A random permutation of the entire data set. Pros: The process of generating {uj } is embarrassingly parallel. Sorting is scalable. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 10 / 40
  • 18. An example of random sort Set n = 100, s = 10, and hence the sampling probability p = s/n = 0.1. 1 Generate random keys: (0.644, t1), (0.378, t2), . . . , (0.587, t10), . . . , (0.500, t99), (0.471, t100) Xiangrui Meng (Databricks) ScaSRS March 3, 2014 11 / 40
  • 19. An example of random sort Set n = 100, s = 10, and hence the sampling probability p = s/n = 0.1. 1 Generate random keys: (0.644, t1), (0.378, t2), . . . , (0.587, t10), . . . , (0.500, t99), (0.471, t100) 2 Sort and select the smallest 10 items: (0.028, t94), (0.029, t44), . . . , (0.137, t69) the smallest 10 items , . . . , (0.980, t26), (0.988, t60) Xiangrui Meng (Databricks) ScaSRS March 3, 2014 11 / 40
  • 20. An example of random sort Set n = 100, s = 10, and hence the sampling probability p = s/n = 0.1. 1 Generate random keys: (0.644, t1), (0.378, t2), . . . , (0.587, t10), . . . , (0.500, t99), (0.471, t100) 2 Sort and select the smallest 10 items: (0.028, t94), (0.029, t44), . . . , (0.137, t69) the smallest 10 items , . . . , (0.980, t26), (0.988, t60) Fact: the 10th item after the sort is associated with a random key 0.137. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 11 / 40
  • 21. Heuristics Qualitatively speaking, if ui is “much larger” than p, then ti is “very unlikely” to be selected; if ui is “much smaller” than p, then ti is “very likely” to be selected. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 12 / 40
  • 22. Heuristics Qualitatively speaking, if ui is “much larger” than p, then ti is “very unlikely” to be selected; if ui is “much smaller” than p, then ti is “very likely” to be selected. Set two thresholds q1 and q2, such that if ui > q1, reject ti directly; if ui < q2, select ti directly; otherwise, put ti onto a waiting list that goes to the sort phase. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 12 / 40
  • 23. Heuristics Qualitatively speaking, if ui is “much larger” than p, then ti is “very unlikely” to be selected; if ui is “much smaller” than p, then ti is “very likely” to be selected. Set two thresholds q1 and q2, such that if ui > q1, reject ti directly; if ui < q2, select ti directly; otherwise, put ti onto a waiting list that goes to the sort phase. The resulting algorithm fails if we reject more than n − s items, or if we select more than s items. Otherwise, it returns the same result as the random sort algorithm. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 12 / 40
  • 24. ScaSRS: a scalable simple random sampling algorithm 1: Let l = 0, and W = ∅ be the waiting list. 2: for each item ti ∈ T do 3: Draw a key ui independently from U(0, 1). 4: if ui < q2 then 5: Select ti and let l := l + 1. 6: else if ui < q1 then 7: Associate ti with ui and add it into W . 8: end if 9: end for 10: Sort W ’s items in the ascending order of the key. 11: Select the smallest pn − l items from W . ScaSRS outputs the same result as the random sort algorithm given the same sequence of random keys, if it succeeds. ScaSRS is embarrassingly parallel. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 13 / 40
  • 25. A quantitative analysis Theorem For a fixed δ > 0, if we choose q1 = min(1, p + γ1 + γ2 1 + 2γ1p), where γ1 = − log δ n , q2 = max(0, p + γ2 − γ2 2 + 3γ2p), where γ2 = − 2 log δ 3n , ScaSRS succeeds with probability at least 1 − 2δ. Moreover, with high probability, it only needs O( √ s) storage and runs in O(n) time. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 14 / 40
  • 26. A practical choice of δ Set δ = 0.00005. We get the following thresholds: q1 = min 1, p + 10 n + 100 n2 + 20p n , q2 = max 0, p + 20 3n − 400 9n2 + 20p n , and ScaSRS succeeds with probability at least 1 − 2δ = 99.99%. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 15 / 40
  • 27. Sketch of proof Denote Ui the random key associated with item i and let Yi = 1Ui <q1 . E[Yi ] = q1 and E[Y 2 i ] = q1. Y = i Yi is the number of un-rejected items during the scan. Apply a Bernstein-type inequality (Maurer, 2003), log Pr{Y ≤ pn} ≤ − (q1 − p)2n 2q1 . With the choice of q1 in our theorem, we have Pr{Y ≤ s} ≤ δ. By similar arguments, we can bound the number of selected items during the scan: Pr i 1Ui <q2 ≥ s ≤ δ. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 16 / 40
  • 28. The size of the waiting list = O( √ s), w.h.p. 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 48000 48500 49000 49500 50000 50500 51000 51500 52000 n = 1e6, k = 50000, p = 0.05 pdf of number of unrejected items pdf of number of accepted items O(sqrt(k)) Xiangrui Meng (Databricks) ScaSRS March 3, 2014 17 / 40
  • 29. Streaming environments (when only p is given) If only p is given, we can update the thresholds q1 and q2 on the fly based on the number of items seen so far, denoted by j: q1,i = min 1, p + γ1,i + γ2 1,i + 2γ1,i p , where γ1,i = − log δ i , q2,i = max 0, p + γ2,i − γ2 2,i + 3γ2,i p , where γ2,i = − 2 log δ 3i . O(log n + √ s log n) storage at least 1 − 2δ success rate not necessary to know the exact s but just a good lower bound use the local count on each process or a global count updated less frequently Xiangrui Meng (Databricks) ScaSRS March 3, 2014 18 / 40
  • 30. Streaming environments (when only s is given) When only s is given, we can no longer accept items on the fly because the sampling probability could be arbitrarily small. However, we can still reject items on the fly based on s and i: q1,i = min 1, s i + γ1,i + γ2 1,i + 2γ1 s i , where γ1,i = − log δ i . s(log n + 1) + O( √ s + log n) storage at least 1 − δ success rate Xiangrui Meng (Databricks) ScaSRS March 3, 2014 19 / 40
  • 31. Stratified sampling If the item set is heterogeneous, it may be possible to partition it into several non-overlapping homogeneous subsets, called strata. Applying SRS within each stratum is preferred to applying SRS to the entire set for better representativeness. This approach is called stratified sampling. Applications: U.S. Census survey political survey Stratification: based on training labels based on days of the week Xiangrui Meng (Databricks) ScaSRS March 3, 2014 20 / 40
  • 32. Stratified sampling (cont.) Applying ScaSRS to stratified sampling is straightforward. Let m be the number of strata. We have the following result: If the size of each stratum is given, we need O( √ ms) storage. If only the sampling probability p is given, we need O(m log n + √ ms log n) storage. If only the sample size k is given, we need s(log n + 1) + O( √ sm log n) storage. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 21 / 40
  • 33. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 22 / 40
  • 34. Empirical evaluation: MapReduce implementation 1: Set l = 0. 2: function map(ti ) 3: Generate ui from U(0, 1). 4: if ui < q2 then 5: Select (and output directly) ti . 6: l := l + 1. 7: else if ui < q1 then 8: Emit (ui , ti ). 9: end if 10: end function 11: function reduce(. . . , (ui , ti ), . . .) 12: Select the first pn − l items. 13: end function Xiangrui Meng (Databricks) ScaSRS March 3, 2014 23 / 40
  • 35. Empirical evaluation: simple random sampling P1 P2 P3 P4 P5 P6 n 6.0e7 6.0e7 3.0e8 3.0e8 1.5e9 1.5e9 p 0.01 0.1 0.01 0.1 0.01 0.1 s 6.0e5 6.0e6 3.0e6 3.0e7 1.5e7 1.5e8 Selection-Rejection 281 355 1371 1475 >3600 >3600 Reservoir 288 299 1285 1571 >3600 >3600 Random Sort 513 581 1629 2344 >3600 >3600 ScaSRS 96 103 126 127 140 158 ScaSRSp 98 114 109 139 162 214 W 6.9e3 2.2e4 1.6e4 4.9e4 3.4e4 1.1e5 Wp 5.8e4 1.8e5 2.9e5 9.1e5 1.5e6 4.5e6 Table : Test problems, running times (in seconds), and waiting list sizes. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 24 / 40
  • 36. Empirical evaluation: stratified sampling Problem and setup: 23.25 billion page-view events, 7 terabytes, 8 strata. The ratio between the size of largest strata and that of the smallest strata is approximately 15000. p = 0.01. 3000 mappers and 5 reducers. Result: 509 seconds. Within the waiting list, the ratio between the size of the largest strata and that of the smallest strata is 861.2. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 25 / 40
  • 37. Summary for ScaSRS based on the random sort algorithm using probabalistic thresholds to decide on the fly whether to select, reject, or wait-list an item independently of others embarrassingly parallel high success rate and O( √ s) storage streaming environments extension to stratified sampling straightfoward MapReduce implementation Xiangrui Meng (Databricks) ScaSRS March 3, 2014 26 / 40
  • 38. Outline 1 Simple random sampling without replacement Existing algorithms Algorithm ScaSRS Streaming environments Stratified sampling Empirical evaluation 2 Simple random sampling with replacement Existing algorithms Algorithm ScaSRSWR Streaming environments Xiangrui Meng (Databricks) ScaSRS March 3, 2014 27 / 40
  • 39. Simple random sampling with replacement (SRSWR) A simple random sample with replacement (SRSWR) of size s from a population of n items can be thought of as drawing s independent samples of size 1, where each of the s items in the sample is selected from the population with equal probability. An item may appear more than once in the sample. Equivalent to sample from Multinomial s, 1 n , 1 n , . . . , 1 n . Xiangrui Meng (Databricks) ScaSRS March 3, 2014 28 / 40
  • 40. Simple random sampling with replacement (SRSWR) A simple random sample with replacement (SRSWR) of size s from a population of n items can be thought of as drawing s independent samples of size 1, where each of the s items in the sample is selected from the population with equal probability. An item may appear more than once in the sample. Equivalent to sample from Multinomial s, 1 n , 1 n , . . . , 1 n . Applications: bootstrapping ensemble methods generating random tuples Xiangrui Meng (Databricks) ScaSRS March 3, 2014 28 / 40
  • 41. The draw-by-draw method Draw-by-draw 1: Set S = ∅. 2: for i from 1 to s do 3: Select one item t with equal probability from T. 4: Let S := S + {t}. 5: end for 6: Return S. Selecting one item with equal probability is hard due to variable-length records, no indices. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 29 / 40
  • 42. The Poisson-approximation algorithm Poisson-approximation (Laserson, 2013) 1: for each item ti ∈ T do 2: Generate a number si from distribution Pois(p). 3: if si > 0 then 4: Repeat ti for si times in the sample. 5: end if 6: end for Xiangrui Meng (Databricks) ScaSRS March 3, 2014 30 / 40
  • 43. The Poisson-approximation algorithm Poisson-approximation (Laserson, 2013) 1: for each item ti ∈ T do 2: Generate a number si from distribution Pois(p). 3: if si > 0 then 4: Repeat ti for si times in the sample. 5: end if 6: end for Pros: One-pass. O(1) storage. Embarrassingly parallel. Cons: Variable sample size. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 30 / 40
  • 44. The Poisson-approximation algorithm (cont.) If Yi ∼ Pois(p), i = 1, . . . , n are independent, then given n i=1 Yi = s, (Y1, Y2, . . . , Yn) follows Multinom s, 1 n , 1 n , . . . , 1 n . If the sample from the Poisson-approximation algorithm happens to have size s = pn, it is a simple random sample with replacement. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 31 / 40
  • 45. The Poisson-approximation algorithm (cont.) If Yi ∼ Pois(p), i = 1, . . . , n are independent, then given n i=1 Yi = s, (Y1, Y2, . . . , Yn) follows Multinom s, 1 n , 1 n , . . . , 1 n . If the sample from the Poisson-approximation algorithm happens to have size s = pn, it is a simple random sample with replacement. If Xi ∼ Pois(λi ), i = 1, . . . , n are independent and λ = n i=1 λi , we have Y = n i=1 Xi ∼ Pois(λ). The size of the sample from the Poisson-approximation algorithm follows distribution Pois(pn). Xiangrui Meng (Databricks) ScaSRS March 3, 2014 31 / 40
  • 46. How to obtain the exact sample size? To get the exact sample size, we follow an approach similar to what we have in ScaSRS. Generate a Poisson sequence to pre-accept items on the fly, where each value follows Pois(p1) independently for some p1 < p. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 32 / 40
  • 47. How to obtain the exact sample size? To get the exact sample size, we follow an approach similar to what we have in ScaSRS. Generate a Poisson sequence to pre-accept items on the fly, where each value follows Pois(p1) independently for some p1 < p. Generate another Poisson sequence to wait-list items, where each value follows Pois(p2) independently for some p2 > 0. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 32 / 40
  • 48. How to obtain the exact sample size? To get the exact sample size, we follow an approach similar to what we have in ScaSRS. Generate a Poisson sequence to pre-accept items on the fly, where each value follows Pois(p1) independently for some p1 < p. Generate another Poisson sequence to wait-list items, where each value follows Pois(p2) independently for some p2 > 0. Let a be the number of items we pre-accepted. Select a simple random sample without replacement of size s − a from the waiting list and merge it into the final sample. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 32 / 40
  • 49. ScaSRSWR: a scalable algorithm for SRSWR 1: Choose δ ∈ (0, 1), p1 < p such that FPois(p1n)(s) ≥ 1 − δ, and p2 > 0 such that FPois((p1+p2)n)(s) ≤ δ. 2: function map(ti ) 3: Generate a number s1i from distribution Pois(p1). 4: Include ti for s1i times in the sample. 5: Generate a number s2i from distribution Pois(p2). 6: for j ∈ {1, . . . , s2i } do 7: Draw a value u from U(0, 1) and emit (u, ti ). 8: end for 9: end function 10: function reduce(. . . , (uk, tk), . . .) 11: Let a be the number of accepted items in step 4. 12: Select the first s − a items ordered by key. 13: end function Xiangrui Meng (Databricks) ScaSRS March 3, 2014 33 / 40
  • 50. ScaSRSWR: a scalable algorithm for SRSWR (cont.) Theorem For a fixed δ > 0, ScaSRSWR outputs a simple random sample with replacement of size s with probability at least 1 − 2δ. Moreover, with high probability, it only needs O( √ s) storage and runs in O(n) time. The algorithm fails to output a sample of size s if it pre-accepted too many items, i.e., a > s, or it wait-listed too few items, i.e., a + w < s, where w is the size of the waiting list. So the overall failure rate is at most 2δ. Given a ≤ s and a + w ≥ s, we can prove that the output is a simple random sample with replacement. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 34 / 40
  • 51. Streaming environments It is possible to extend ScaSRSWR to a streaming environment with some tweaks. Assuming that only p is given, we need three Poisson sequences: 1 The first sequence generates pre-accepted items, where X1i ∼ Pois(p1i ), which means including ti for X1i times. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 35 / 40
  • 52. Streaming environments It is possible to extend ScaSRSWR to a streaming environment with some tweaks. Assuming that only p is given, we need three Poisson sequences: 1 The first sequence generates pre-accepted items, where X1i ∼ Pois(p1i ), which means including ti for X1i times. 2 The second sequence gets “merged” with the first sequence at the end of the stream such that each element in the merged sequence follows Pois(p1), which is the same as the non-streaming case. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 35 / 40
  • 53. Streaming environments It is possible to extend ScaSRSWR to a streaming environment with some tweaks. Assuming that only p is given, we need three Poisson sequences: 1 The first sequence generates pre-accepted items, where X1i ∼ Pois(p1i ), which means including ti for X1i times. 2 The second sequence gets “merged” with the first sequence at the end of the stream such that each element in the merged sequence follows Pois(p1), which is the same as the non-streaming case. 3 The third sequence is collected at the end of the stream and transformed such that each element in the transformed sequence follows Pois(p2), which is the same as the non-streaming case. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 35 / 40
  • 54. How to merge two Poisson sequences? If Xi ∼ Pois(λi ) are independent and λ = i λi , Y = i Xi ∼ Pois(λ). Suppose the ith item in the first sequence follows Pois(p1i ). We need the ith item in the second sequence follows Pois(p1 − p1i ) in order to have the sum follows Pois(p1). However, p1 depends on n, which is unknown until we reach the end of the stream. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 36 / 40
  • 55. How to merge two Poisson sequences? If Xi ∼ Pois(λi ) are independent and λ = i λi , Y = i Xi ∼ Pois(λ). Suppose the ith item in the first sequence follows Pois(p1i ). We need the ith item in the second sequence follows Pois(p1 − p1i ) in order to have the sum follows Pois(p1). However, p1 depends on n, which is unknown until we reach the end of the stream. If X ∼ Pois(λ) and Y ∼ Binom(X, p), then Y ∼ Pois(λp). We can transform the second sequence at the end of the stream to make each item in the merged sequence follows Pois(p1) as long as p1i ≤ p1 and p1i + pc 1i ≥ p1, i = 1, . . . , n: If X1i ∼ Pois(p1i ), Xc 1i ∼ Pois(p − p1i ), and Y c 1i ∼ Binom Xc 1i , p1−p1i p−p1i , then X1i + Y c 1i ∼ Pois(p1). Xiangrui Meng (Databricks) ScaSRS March 3, 2014 36 / 40
  • 56. Streaming environments For any λ > 100, it is easy to verify the following bounds: Pr{X ≤ λ} > 0.99995, X ∼ Pois(λ − 5 √ λ), Pr{X < λ} < 1 − 0.99995, X ∼ Pois(λ + 5 √ λ). Assuming that n > n0 > 100/p, we are going to choose p1 = p − 5 p/n and p2 = 10 p/n. For each i ∈ {1, . . . , n}, we set p1i = p − 5 p/ max(i, n0), pc 1i = p − p1i = 5 p/ max(i, n0), p2i = 10 p/ max(i, n0). The algorithm succeeds with probability at least 0.9999. The expected storage is n i=1 (pc 1i + p2i ) ≤ n i=1 15 p/i = O( √ pn) = O( √ s). Xiangrui Meng (Databricks) ScaSRS March 3, 2014 37 / 40
  • 57. Streaming ScaSRSWR (when only p is given) 1: function map(ti ) 2: Let p1i = p − 5 p/ max(i, n0) and generate s1i ∼ Pois(p1i ). 3: Include ti for s1i times in the sample. 4: Let pc 1i = p − p1i and generate sc 1i ∼ Pois(pc 1i ). 5: Emit (0, (p1i , sc 1i , ti )) if sc 1i > 0. 6: Let p2i = 10 p/ max(i, n0) and generate s2i ∼ Pois(p2i ). 7: Emit (1, (p2i , s2i , ti )) if s2i > 0. 8: end function Xiangrui Meng (Databricks) ScaSRS March 3, 2014 38 / 40
  • 58. Streaming ScaSRSWR (when only p is given) 1: function map(ti ) 2: Let p1i = p − 5 p/ max(i, n0) and generate s1i ∼ Pois(p1i ). 3: Include ti for s1i times in the sample. 4: Let pc 1i = p − p1i and generate sc 1i ∼ Pois(pc 1i ). 5: Emit (0, (p1i , sc 1i , ti )) if sc 1i > 0. 6: Let p2i = 10 p/ max(i, n0) and generate s2i ∼ Pois(p2i ). 7: Emit (1, (p2i , s2i , ti )) if s2i > 0. 8: end function 9: Compute p1 = p − 5 p/n and p2 = 10 p/n. 10: function reduce(0, [. . . , (p1k , sc 1k , tk ), . . .]) 11: Generate ¯sc 1k ∼ Binom(sc 1k , p1−p1k p−p1k ) and include tk for ¯sc 1k times. 12: end function Xiangrui Meng (Databricks) ScaSRS March 3, 2014 38 / 40
  • 59. Streaming ScaSRSWR (when only p is given) 1: function map(ti ) 2: Let p1i = p − 5 p/ max(i, n0) and generate s1i ∼ Pois(p1i ). 3: Include ti for s1i times in the sample. 4: Let pc 1i = p − p1i and generate sc 1i ∼ Pois(pc 1i ). 5: Emit (0, (p1i , sc 1i , ti )) if sc 1i > 0. 6: Let p2i = 10 p/ max(i, n0) and generate s2i ∼ Pois(p2i ). 7: Emit (1, (p2i , s2i , ti )) if s2i > 0. 8: end function 9: Compute p1 = p − 5 p/n and p2 = 10 p/n. 10: function reduce(0, [. . . , (p1k , sc 1k , tk ), . . .]) 11: Generate ¯sc 1k ∼ Binom(sc 1k , p1−p1k p−p1k ) and include tk for ¯sc 1k times. 12: end function 13: Let a be the number of accepted items and W = ∅. 14: function reduce(1, [. . . , (p2k , s2k , tk ), . . .]) 15: Generate ¯s2k ∼ Binom(s2k , p2 p2k ) and add tk to W for ¯s2k times. 16: end function 17: Select a simple random sample of size s − a from W and output it. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 38 / 40
  • 60. Streaming ScaSRSWR (when only s is given) When only s is given, we can no longer accept items on the fly because the sampling probability could be arbitrarily small. However, we can still generate a Poisson sequence as the waiting list W : Choose Xi ∼ Pois((s + 5 √ s)/ max(i, n0)). At the end of the stream, we can adjust the sequence using Binomial values to make the Poission numbers i.i.d.. The sum of adjusted sequence is greater than s with high probability. Storage: O( i (s + 5 √ s)/ max(i, n0)) = O(s log n). After adjustment, generate a simple random sample without replacement of size s from W and output it as the final sample. Xiangrui Meng (Databricks) ScaSRS March 3, 2014 39 / 40
  • 61. Summary Scalable simple random sampling algorithms — ScaSRS and ScaSRSWR: independently select, reject, or wait-list an item on the fly embarrassingly parallel high success rate and O( √ s) storage streaming environments extension to stratified sampling open-source implementations: Apache Spark: https://github.com/mengxr/spark-sampling/ Apache DataFu: http://datafu.incubator.apache.org/ Xiangrui Meng (Databricks) ScaSRS March 3, 2014 40 / 40