Introduction to Counting
APAM E4990
Modeling Social Data
Jake Hofman
Columbia University
January 30, 2013
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 1 / 28
Why counting?
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 2 / 28
Why counting?
sta·tis·tic
noun
1. A function of a random sample of data
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 3 / 28
Why counting?
p( y
support
| x
age
)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 4 / 28
Why counting?
?p( y
support
| x1, x2, x3, . . .
age, sex, race, party
)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 4 / 28
Why counting?
Problem:
Traditionally difficult to obtain reliable estimates due to small
sample sizes or sparsity
(e.g., ∼ 100 age × 2 sex × 5 race × 3 party = 3000 groups,
but typical surveys collect ∼ 1,000s of responses)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 5 / 28
Why counting?
Potential solution:
Sacrifice granularity for precision, by binning observations into
larger, but fewer, groups
(e.g., bin age into a few groups: 18-29, 30-49, 50-64, 65+)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 5 / 28
Why counting?
Potential solution:
Develop more sophisticated methods that generalize well from
small samples
(e.g., fit a model: support ∼ β0 + β1age + β2age2 + . . .)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 5 / 28
Why counting?
(Partial) solution:
Obtain larger samples through other means, so we can just count
and divide to make estimates via relative frequencies
(e.g., with ∼ 1M responses, we have 100s per group and can
estimate support within a few percentage points)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 6 / 28
Why counting?
The good:
Shift away from sophisticated statistical methods on small samples
to simple methods on large samples
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 7 / 28
Why counting?
The bad:
Even simple methods (e.g., counting) are computationally
challenging at large scales
(1M is easy, 1B a bit less so, 1T gets interesting)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 7 / 28
Why counting?
Claim:
Solving the counting problem at scale enables you to investigate
many interesting questions in the social sciences
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 7 / 28
Learning to count
This week:
Counting at small/medium scales on a single machine
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 8 / 28
Learning to count
This week:
Counting at small/medium scales on a single machine
Following weeks:
Counting at large scales in parallel
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 8 / 28
Counting, the easy way
Split / Apply / Combine1
http://bit.ly/sactutorial
• Load dataset into memory
• Split: Arrange observations into groups of interest
• Apply: Compute distributions and statistics within each group
• Combine: Collect results across groups
1
http://bit.ly/splitapplycombine
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 9 / 28
The generic group-by operation
Split / Apply / Combine
for each observation as (group, value):
place value in bucket for corresponding group
for each group:
apply a function over values in bucket
output group and result
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 10 / 28
The generic group-by operation
Split / Apply / Combine
for each observation as (group, value):
place value in bucket for corresponding group
for each group:
apply a function over values in bucket
output group and result
Useful for computing arbitrary within-group statistics when we
have required memory
(e.g., conditional distribution, median, etc.)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 10 / 28
Why counting?
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 11 / 28
Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 12 / 28
Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 12 / 28
Example: Movielens
0
1,000,000
2,000,000
3,000,000
1 2 3 4 5
Rating
Numberofratings
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 13 / 28
Example: Movielens
0
1,000,000
2,000,000
3,000,000
1 2 3 4 5
Rating
Numberofratings
group by rating value
for each group:
count # ratings
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 14 / 28
Example: Movielens
1 2 3 4 5
Mean Rating by Movie
Density
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 15 / 28
Example: Movielens
group by movie id
for each group:
compute average rating
1 2 3 4 5
Mean Rating by Movie
Density
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 16 / 28
Example: Movielens
0%
25%
50%
75%
100%
0 3,000 6,000 9,000
Movie Rank
CDF
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 17 / 28
Example: Movielens
0%
25%
50%
75%
100%
0 3,000 6,000 9,000
Movie Rank
CDF
group by movie id
for each group:
count # ratings
sort by group size
cumulatively sum group sizes
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 18 / 28
Example: Movielens
0
2,000
4,000
6,000
8,000
100 10,000
User eccentricity
Numberofusers
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 19 / 28
Example: Movielens
join movie ranks to ratings
group by user id
for each group:
compute median movie rank
0
2,000
4,000
6,000
8,000
100 10,000
User eccentricity
Numberofusers
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 20 / 28
Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Sampling?
Unreliable estimates for rare groups
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Random access from disk?
1000x more storage, but 1000x slower2
2
Numbers every programmer should know
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Streaming
Read data one observation at a time, storing only needed state
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
The combinable group-by operation
Streaming
for each observation as (group, value):
if new group:
initialize result
update result for corresponding group as function of
existing result and current value
for each group:
output group and result
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 22 / 28
The combinable group-by operation
Streaming
for each observation as (group, value):
if new group:
initialize result
update result for corresponding group as function of
existing result and current value
for each group:
output group and result
Useful for computing a subset of within-group statistics with a
limited memory footprint
(e.g., min, mean, max, variance, etc.)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 22 / 28
Example: Movielens
0
1,000,000
2,000,000
3,000,000
1 2 3 4 5
Rating
Numberofratings
for each rating:
counts[movie id]++
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 23 / 28
Example: Movielens
for each rating:
totals[movie id] += rating
counts[movie id]++
for each group:
totals[movie id] /
counts[movie id]
1 2 3 4 5
Mean Rating by Movie
Density
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 24 / 28
Yet another group-by operation
Per-group histograms
for each observation as (group, value):
histogram[group][value]++
for each group:
compute result as a function of histogram
output group and result
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 25 / 28
Yet another group-by operation
Per-group histograms
for each observation as (group, value):
histogram[group][value]++
for each group:
compute result as a function of histogram
output group and result
We can recover arbitrary statistics if we can afford to store counts
of all distinct values within in each group
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 25 / 28
The group-by operation
For arbitrary input data:
Memory Scenario Distributions Statistics
N Small dataset Yes General
V*G Small distributions Yes General
G Small # groups No Combinable
V Small # outcomes No No
1 Large # both No No
N = total number of observations
G = number of distinct groups
V = largest number of distinct values within group
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 26 / 28
Examples (w/ 8GB RAM)
Median rating by movie for Netflix
N ∼ 100M ratings
G ∼ 20K movies
V ∼ 10 half-star values
V *G ∼ 200K, store per-group histograms for arbitrary statistics
(scales to arbitrary N, if you’re patient)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 27 / 28
Examples (w/ 8GB RAM)
Median rating by video for YouTube
N ∼ 10B ratings
G ∼ 1B videos
V ∼ 10 half-star values
V *G ∼ 10B, fails because per-group histograms are too large to
store in memory
G ∼ 1B, but no (exact) calculation for streaming median
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 27 / 28
Examples (w/ 8GB RAM)
Mean rating by video for YouTube
N ∼ 10B ratings
G ∼ 1B videos
V ∼ 10 half-star values
G ∼ 1B, use streaming to compute combinable statistics
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 27 / 28
The group-by operation
For pre-grouped input data:
Memory Scenario Distributions Statistics
N Small dataset Yes General
V*G Small distributions Yes General
G Small # groups No Combinable
V Small # outcomes Yes General
1 Large # both No Combinable
N = total number of observations
G = number of distinct groups
V = largest number of distinct values within group
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 28 / 28

Modeling Social Data, Lecture 2: Introduction to Counting

  • 1.
    Introduction to Counting APAME4990 Modeling Social Data Jake Hofman Columbia University January 30, 2013 Jake Hofman (Columbia University) Intro to Counting January 30, 2013 1 / 28
  • 2.
    Why counting? Jake Hofman(Columbia University) Intro to Counting January 30, 2013 2 / 28
  • 3.
    Why counting? sta·tis·tic noun 1. Afunction of a random sample of data Jake Hofman (Columbia University) Intro to Counting January 30, 2013 3 / 28
  • 4.
    Why counting? p( y support |x age ) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 4 / 28
  • 5.
    Why counting? ?p( y support |x1, x2, x3, . . . age, sex, race, party ) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 4 / 28
  • 6.
    Why counting? Problem: Traditionally difficultto obtain reliable estimates due to small sample sizes or sparsity (e.g., ∼ 100 age × 2 sex × 5 race × 3 party = 3000 groups, but typical surveys collect ∼ 1,000s of responses) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 5 / 28
  • 7.
    Why counting? Potential solution: Sacrificegranularity for precision, by binning observations into larger, but fewer, groups (e.g., bin age into a few groups: 18-29, 30-49, 50-64, 65+) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 5 / 28
  • 8.
    Why counting? Potential solution: Developmore sophisticated methods that generalize well from small samples (e.g., fit a model: support ∼ β0 + β1age + β2age2 + . . .) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 5 / 28
  • 9.
    Why counting? (Partial) solution: Obtainlarger samples through other means, so we can just count and divide to make estimates via relative frequencies (e.g., with ∼ 1M responses, we have 100s per group and can estimate support within a few percentage points) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 6 / 28
  • 10.
    Why counting? The good: Shiftaway from sophisticated statistical methods on small samples to simple methods on large samples Jake Hofman (Columbia University) Intro to Counting January 30, 2013 7 / 28
  • 11.
    Why counting? The bad: Evensimple methods (e.g., counting) are computationally challenging at large scales (1M is easy, 1B a bit less so, 1T gets interesting) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 7 / 28
  • 12.
    Why counting? Claim: Solving thecounting problem at scale enables you to investigate many interesting questions in the social sciences Jake Hofman (Columbia University) Intro to Counting January 30, 2013 7 / 28
  • 13.
    Learning to count Thisweek: Counting at small/medium scales on a single machine Jake Hofman (Columbia University) Intro to Counting January 30, 2013 8 / 28
  • 14.
    Learning to count Thisweek: Counting at small/medium scales on a single machine Following weeks: Counting at large scales in parallel Jake Hofman (Columbia University) Intro to Counting January 30, 2013 8 / 28
  • 15.
    Counting, the easyway Split / Apply / Combine1 http://bit.ly/sactutorial • Load dataset into memory • Split: Arrange observations into groups of interest • Apply: Compute distributions and statistics within each group • Combine: Collect results across groups 1 http://bit.ly/splitapplycombine Jake Hofman (Columbia University) Intro to Counting January 30, 2013 9 / 28
  • 16.
    The generic group-byoperation Split / Apply / Combine for each observation as (group, value): place value in bucket for corresponding group for each group: apply a function over values in bucket output group and result Jake Hofman (Columbia University) Intro to Counting January 30, 2013 10 / 28
  • 17.
    The generic group-byoperation Split / Apply / Combine for each observation as (group, value): place value in bucket for corresponding group for each group: apply a function over values in bucket output group and result Useful for computing arbitrary within-group statistics when we have required memory (e.g., conditional distribution, median, etc.) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 10 / 28
  • 18.
    Why counting? Jake Hofman(Columbia University) Intro to Counting January 30, 2013 11 / 28
  • 19.
    Example: Anatomy ofthe long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M Jake Hofman (Columbia University) Intro to Counting January 30, 2013 12 / 28
  • 20.
    Example: Anatomy ofthe long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M Jake Hofman (Columbia University) Intro to Counting January 30, 2013 12 / 28
  • 21.
    Example: Movielens 0 1,000,000 2,000,000 3,000,000 1 23 4 5 Rating Numberofratings Jake Hofman (Columbia University) Intro to Counting January 30, 2013 13 / 28
  • 22.
    Example: Movielens 0 1,000,000 2,000,000 3,000,000 1 23 4 5 Rating Numberofratings group by rating value for each group: count # ratings Jake Hofman (Columbia University) Intro to Counting January 30, 2013 14 / 28
  • 23.
    Example: Movielens 1 23 4 5 Mean Rating by Movie Density Jake Hofman (Columbia University) Intro to Counting January 30, 2013 15 / 28
  • 24.
    Example: Movielens group bymovie id for each group: compute average rating 1 2 3 4 5 Mean Rating by Movie Density Jake Hofman (Columbia University) Intro to Counting January 30, 2013 16 / 28
  • 25.
    Example: Movielens 0% 25% 50% 75% 100% 0 3,0006,000 9,000 Movie Rank CDF Jake Hofman (Columbia University) Intro to Counting January 30, 2013 17 / 28
  • 26.
    Example: Movielens 0% 25% 50% 75% 100% 0 3,0006,000 9,000 Movie Rank CDF group by movie id for each group: count # ratings sort by group size cumulatively sum group sizes Jake Hofman (Columbia University) Intro to Counting January 30, 2013 18 / 28
  • 27.
    Example: Movielens 0 2,000 4,000 6,000 8,000 100 10,000 Usereccentricity Numberofusers Jake Hofman (Columbia University) Intro to Counting January 30, 2013 19 / 28
  • 28.
    Example: Movielens join movieranks to ratings group by user id for each group: compute median movie rank 0 2,000 4,000 6,000 8,000 100 10,000 User eccentricity Numberofusers Jake Hofman (Columbia University) Intro to Counting January 30, 2013 20 / 28
  • 29.
    Example: Anatomy ofthe long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
  • 30.
    Example: Anatomy ofthe long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Sampling? Unreliable estimates for rare groups Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
  • 31.
    Example: Anatomy ofthe long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Random access from disk? 1000x more storage, but 1000x slower2 2 Numbers every programmer should know Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
  • 32.
    Example: Anatomy ofthe long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Streaming Read data one observation at a time, storing only needed state Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
  • 33.
    The combinable group-byoperation Streaming for each observation as (group, value): if new group: initialize result update result for corresponding group as function of existing result and current value for each group: output group and result Jake Hofman (Columbia University) Intro to Counting January 30, 2013 22 / 28
  • 34.
    The combinable group-byoperation Streaming for each observation as (group, value): if new group: initialize result update result for corresponding group as function of existing result and current value for each group: output group and result Useful for computing a subset of within-group statistics with a limited memory footprint (e.g., min, mean, max, variance, etc.) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 22 / 28
  • 35.
    Example: Movielens 0 1,000,000 2,000,000 3,000,000 1 23 4 5 Rating Numberofratings for each rating: counts[movie id]++ Jake Hofman (Columbia University) Intro to Counting January 30, 2013 23 / 28
  • 36.
    Example: Movielens for eachrating: totals[movie id] += rating counts[movie id]++ for each group: totals[movie id] / counts[movie id] 1 2 3 4 5 Mean Rating by Movie Density Jake Hofman (Columbia University) Intro to Counting January 30, 2013 24 / 28
  • 37.
    Yet another group-byoperation Per-group histograms for each observation as (group, value): histogram[group][value]++ for each group: compute result as a function of histogram output group and result Jake Hofman (Columbia University) Intro to Counting January 30, 2013 25 / 28
  • 38.
    Yet another group-byoperation Per-group histograms for each observation as (group, value): histogram[group][value]++ for each group: compute result as a function of histogram output group and result We can recover arbitrary statistics if we can afford to store counts of all distinct values within in each group Jake Hofman (Columbia University) Intro to Counting January 30, 2013 25 / 28
  • 39.
    The group-by operation Forarbitrary input data: Memory Scenario Distributions Statistics N Small dataset Yes General V*G Small distributions Yes General G Small # groups No Combinable V Small # outcomes No No 1 Large # both No No N = total number of observations G = number of distinct groups V = largest number of distinct values within group Jake Hofman (Columbia University) Intro to Counting January 30, 2013 26 / 28
  • 40.
    Examples (w/ 8GBRAM) Median rating by movie for Netflix N ∼ 100M ratings G ∼ 20K movies V ∼ 10 half-star values V *G ∼ 200K, store per-group histograms for arbitrary statistics (scales to arbitrary N, if you’re patient) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 27 / 28
  • 41.
    Examples (w/ 8GBRAM) Median rating by video for YouTube N ∼ 10B ratings G ∼ 1B videos V ∼ 10 half-star values V *G ∼ 10B, fails because per-group histograms are too large to store in memory G ∼ 1B, but no (exact) calculation for streaming median Jake Hofman (Columbia University) Intro to Counting January 30, 2013 27 / 28
  • 42.
    Examples (w/ 8GBRAM) Mean rating by video for YouTube N ∼ 10B ratings G ∼ 1B videos V ∼ 10 half-star values G ∼ 1B, use streaming to compute combinable statistics Jake Hofman (Columbia University) Intro to Counting January 30, 2013 27 / 28
  • 43.
    The group-by operation Forpre-grouped input data: Memory Scenario Distributions Statistics N Small dataset Yes General V*G Small distributions Yes General G Small # groups No Combinable V Small # outcomes Yes General 1 Large # both No Combinable N = total number of observations G = number of distinct groups V = largest number of distinct values within group Jake Hofman (Columbia University) Intro to Counting January 30, 2013 28 / 28