Computational Social Science, Lecture 02: An Introduction to Counting

2,930 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,930
On SlideShare
0
From Embeds
0
Number of Embeds
2,157
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Computational Social Science, Lecture 02: An Introduction to Counting

  1. 1. Introduction to Counting APAM E4990 Computational Social Science Jake Hofman Columbia University February 1, 2013Jake Hofman (Columbia University) Intro to Counting February 1, 2013 1 / 21
  2. 2. Why counting?Jake Hofman (Columbia University) Intro to Counting February 1, 2013 2 / 21
  3. 3. Why counting? sta·tis·tic noun 1. A function of a random sample of data 2. A functional of the distribution over a random variableJake Hofman (Columbia University) Intro to Counting February 1, 2013 3 / 21
  4. 4. Why counting? Problem: Traditionally difficult to estimate distributions due to small sample sizes or sparsityJake Hofman (Columbia University) Intro to Counting February 1, 2013 4 / 21
  5. 5. Why counting? Potential solution: Sacrifice granularity for precision (e.g., bin observations into larger groups)Jake Hofman (Columbia University) Intro to Counting February 1, 2013 4 / 21
  6. 6. Why counting? Potential solution: Develop sophisticated methods that generalize well from small samples (e.g., model distributions with parametric forms)Jake Hofman (Columbia University) Intro to Counting February 1, 2013 4 / 21
  7. 7. Why counting?Jake Hofman (Columbia University) Intro to Counting February 1, 2013 5 / 21
  8. 8. Why counting? (Partial) solution: When observations are plentiful, simply count and divide to estimate distributions with relative frequenciesJake Hofman (Columbia University) Intro to Counting February 1, 2013 6 / 21
  9. 9. Why counting? The good: Shift away from sophisticated statistical methods on small samples to simple methods on large samplesJake Hofman (Columbia University) Intro to Counting February 1, 2013 7 / 21
  10. 10. Why counting? The bad: Even simple methods (e.g., counting) are computationally challenging at large scalesJake Hofman (Columbia University) Intro to Counting February 1, 2013 7 / 21
  11. 11. Why counting? Claim: Solving the counting problem at scale enables you to investigate many interesting questions in the social sciencesJake Hofman (Columbia University) Intro to Counting February 1, 2013 7 / 21
  12. 12. Why counting?Jake Hofman (Columbia University) Intro to Counting February 1, 2013 8 / 21
  13. 13. Learning to count This week: Counting at small/medium scales on a single machineJake Hofman (Columbia University) Intro to Counting February 1, 2013 9 / 21
  14. 14. Learning to count This week: Counting at small/medium scales on a single machine Following weeks: Counting at large scales in parallelJake Hofman (Columbia University) Intro to Counting February 1, 2013 9 / 21
  15. 15. Counting, the easy way • Load dataset into memory • Arrange observations into groups of interest • Compute distributions and statistics within each groupJake Hofman (Columbia University) Intro to Counting February 1, 2013 10 / 21
  16. 16. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100MJake Hofman (Columbia University) Intro to Counting February 1, 2013 11 / 21
  17. 17. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100MJake Hofman (Columbia University) Intro to Counting February 1, 2013 11 / 21
  18. 18. Example: Movielens 3,000,000 2,000,000 Count 1,000,000 0 1 2 3 4 5 RatingJake Hofman (Columbia University) Intro to Counting February 1, 2013 12 / 21
  19. 19. Example: Movielens Density 1 2 3 4 5 Mean Rating by MovieJake Hofman (Columbia University) Intro to Counting February 1, 2013 13 / 21
  20. 20. Example: Movielens 100% 75% CDF 50% 25% 0% 0 3,000 6,000 9,000 Movie RankJake Hofman (Columbia University) Intro to Counting February 1, 2013 14 / 21
  21. 21. Example: Movielens Density 10 100 1,000 User eccentricityJake Hofman (Columbia University) Intro to Counting February 1, 2013 15 / 21
  22. 22. The group-by operation We estimate conditional distributions by grouping observations by their group identity, or key, and counting values within groups p( y | x1 , x2 , x3 , . . .) value keyJake Hofman (Columbia University) Intro to Counting February 1, 2013 16 / 21
  23. 23. The group-by operation Generic Group-By for each observation: place observation in bucket for corresponding group for each group: compute function over observations in bucket output group and resultJake Hofman (Columbia University) Intro to Counting February 1, 2013 17 / 21
  24. 24. The group-by operation Generic Group-By for each observation: place observation in bucket for corresponding group for each group: compute function over observations in bucket output group and result Useful for computing arbitrary within-group statistics when we have required memory (e.g., conditional distribution, median, etc.)Jake Hofman (Columbia University) Intro to Counting February 1, 2013 17 / 21
  25. 25. The group-by operation Combinable Group-By for each observation: update result for corresponding group, as function of current result and observation for each group: output group and resultJake Hofman (Columbia University) Intro to Counting February 1, 2013 18 / 21
  26. 26. The group-by operation Combinable Group-By for each observation: update result for corresponding group, as function of current result and observation for each group: output group and result Useful for computing a subset of within-group statistics with a limited memory footprint (e.g., min, mean, max, variance, etc.)Jake Hofman (Columbia University) Intro to Counting February 1, 2013 18 / 21
  27. 27. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory?Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
  28. 28. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Sampling? Unreliable estimates for rare groupsJake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
  29. 29. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Random access from disk? 1000x more storage, but 1000x slower1 1 Numbers every programmer should knowJake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
  30. 30. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Streaming? Read data one observation at a time, storing only needed stateJake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
  31. 31. The group-by operation For arbitrary input data: Memory Scenario Distributions Statistics N Small dataset Yes General V*G Small distributions Yes General G Small # groups No Combinable V Small # outcomes No No 1 Large # both No No N = total number of observations G = number of distinct groups V = largest number of distinct values within groupJake Hofman (Columbia University) Intro to Counting February 1, 2013 20 / 21
  32. 32. The group-by operation For pre-grouped input data: Memory Scenario Distributions Statistics N Small dataset Yes General V*G Small distributions Yes General G Small # groups No Combinable V Small # outcomes Yes General 1 Large # both No Combinable N = total number of observations G = number of distinct groups V = largest number of distinct values within groupJake Hofman (Columbia University) Intro to Counting February 1, 2013 21 / 21

×