SlideShare a Scribd company logo
Introduction to Counting
APAM E4990
Modeling Social Data
Jake Hofman
Columbia University
January 30, 2013
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 1 / 28
Why counting?
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 2 / 28
Why counting?
sta·tis·tic
noun
1. A function of a random sample of data
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 3 / 28
Why counting?
p( y
support
| x
age
)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 4 / 28
Why counting?
?p( y
support
| x1, x2, x3, . . .
age, sex, race, party
)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 4 / 28
Why counting?
Problem:
Traditionally difficult to obtain reliable estimates due to small
sample sizes or sparsity
(e.g., ∼ 100 age × 2 sex × 5 race × 3 party = 3000 groups,
but typical surveys collect ∼ 1,000s of responses)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 5 / 28
Why counting?
Potential solution:
Sacrifice granularity for precision, by binning observations into
larger, but fewer, groups
(e.g., bin age into a few groups: 18-29, 30-49, 50-64, 65+)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 5 / 28
Why counting?
Potential solution:
Develop more sophisticated methods that generalize well from
small samples
(e.g., fit a model: support ∼ β0 + β1age + β2age2 + . . .)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 5 / 28
Why counting?
(Partial) solution:
Obtain larger samples through other means, so we can just count
and divide to make estimates via relative frequencies
(e.g., with ∼ 1M responses, we have 100s per group and can
estimate support within a few percentage points)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 6 / 28
Why counting?
The good:
Shift away from sophisticated statistical methods on small samples
to simple methods on large samples
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 7 / 28
Why counting?
The bad:
Even simple methods (e.g., counting) are computationally
challenging at large scales
(1M is easy, 1B a bit less so, 1T gets interesting)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 7 / 28
Why counting?
Claim:
Solving the counting problem at scale enables you to investigate
many interesting questions in the social sciences
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 7 / 28
Learning to count
This week:
Counting at small/medium scales on a single machine
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 8 / 28
Learning to count
This week:
Counting at small/medium scales on a single machine
Following weeks:
Counting at large scales in parallel
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 8 / 28
Counting, the easy way
Split / Apply / Combine1
http://bit.ly/sactutorial
• Load dataset into memory
• Split: Arrange observations into groups of interest
• Apply: Compute distributions and statistics within each group
• Combine: Collect results across groups
1
http://bit.ly/splitapplycombine
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 9 / 28
The generic group-by operation
Split / Apply / Combine
for each observation as (group, value):
place value in bucket for corresponding group
for each group:
apply a function over values in bucket
output group and result
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 10 / 28
The generic group-by operation
Split / Apply / Combine
for each observation as (group, value):
place value in bucket for corresponding group
for each group:
apply a function over values in bucket
output group and result
Useful for computing arbitrary within-group statistics when we
have required memory
(e.g., conditional distribution, median, etc.)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 10 / 28
Why counting?
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 11 / 28
Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 12 / 28
Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 12 / 28
Example: Movielens
0
1,000,000
2,000,000
3,000,000
1 2 3 4 5
Rating
Numberofratings
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 13 / 28
Example: Movielens
0
1,000,000
2,000,000
3,000,000
1 2 3 4 5
Rating
Numberofratings
group by rating value
for each group:
count # ratings
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 14 / 28
Example: Movielens
1 2 3 4 5
Mean Rating by Movie
Density
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 15 / 28
Example: Movielens
group by movie id
for each group:
compute average rating
1 2 3 4 5
Mean Rating by Movie
Density
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 16 / 28
Example: Movielens
0%
25%
50%
75%
100%
0 3,000 6,000 9,000
Movie Rank
CDF
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 17 / 28
Example: Movielens
0%
25%
50%
75%
100%
0 3,000 6,000 9,000
Movie Rank
CDF
group by movie id
for each group:
count # ratings
sort by group size
cumulatively sum group sizes
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 18 / 28
Example: Movielens
0
2,000
4,000
6,000
8,000
100 10,000
User eccentricity
Numberofusers
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 19 / 28
Example: Movielens
join movie ranks to ratings
group by user id
for each group:
compute median movie rank
0
2,000
4,000
6,000
8,000
100 10,000
User eccentricity
Numberofusers
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 20 / 28
Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Sampling?
Unreliable estimates for rare groups
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Random access from disk?
1000x more storage, but 1000x slower2
2
Numbers every programmer should know
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Streaming
Read data one observation at a time, storing only needed state
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
The combinable group-by operation
Streaming
for each observation as (group, value):
if new group:
initialize result
update result for corresponding group as function of
existing result and current value
for each group:
output group and result
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 22 / 28
The combinable group-by operation
Streaming
for each observation as (group, value):
if new group:
initialize result
update result for corresponding group as function of
existing result and current value
for each group:
output group and result
Useful for computing a subset of within-group statistics with a
limited memory footprint
(e.g., min, mean, max, variance, etc.)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 22 / 28
Example: Movielens
0
1,000,000
2,000,000
3,000,000
1 2 3 4 5
Rating
Numberofratings
for each rating:
counts[movie id]++
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 23 / 28
Example: Movielens
for each rating:
totals[movie id] += rating
counts[movie id]++
for each group:
totals[movie id] /
counts[movie id]
1 2 3 4 5
Mean Rating by Movie
Density
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 24 / 28
Yet another group-by operation
Per-group histograms
for each observation as (group, value):
histogram[group][value]++
for each group:
compute result as a function of histogram
output group and result
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 25 / 28
Yet another group-by operation
Per-group histograms
for each observation as (group, value):
histogram[group][value]++
for each group:
compute result as a function of histogram
output group and result
We can recover arbitrary statistics if we can afford to store counts
of all distinct values within in each group
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 25 / 28
The group-by operation
For arbitrary input data:
Memory Scenario Distributions Statistics
N Small dataset Yes General
V*G Small distributions Yes General
G Small # groups No Combinable
V Small # outcomes No No
1 Large # both No No
N = total number of observations
G = number of distinct groups
V = largest number of distinct values within group
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 26 / 28
Examples (w/ 8GB RAM)
Median rating by movie for Netflix
N ∼ 100M ratings
G ∼ 20K movies
V ∼ 10 half-star values
V *G ∼ 200K, store per-group histograms for arbitrary statistics
(scales to arbitrary N, if you’re patient)
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 27 / 28
Examples (w/ 8GB RAM)
Median rating by video for YouTube
N ∼ 10B ratings
G ∼ 1B videos
V ∼ 10 half-star values
V *G ∼ 10B, fails because per-group histograms are too large to
store in memory
G ∼ 1B, but no (exact) calculation for streaming median
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 27 / 28
Examples (w/ 8GB RAM)
Mean rating by video for YouTube
N ∼ 10B ratings
G ∼ 1B videos
V ∼ 10 half-star values
G ∼ 1B, use streaming to compute combinable statistics
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 27 / 28
The group-by operation
For pre-grouped input data:
Memory Scenario Distributions Statistics
N Small dataset Yes General
V*G Small distributions Yes General
G Small # groups No Combinable
V Small # outcomes Yes General
1 Large # both No Combinable
N = total number of observations
G = number of distinct groups
V = largest number of distinct values within group
Jake Hofman (Columbia University) Intro to Counting January 30, 2013 28 / 28

More Related Content

Viewers also liked

Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIjakehofman
 
Internet messesage
Internet messesageInternet messesage
Internet messesagebearobo
 
Butchart gardens, canada
Butchart gardens, canadaButchart gardens, canada
Butchart gardens, canadafilipj2000
 
Mexico by john2
Mexico by john2  Mexico by john2
Mexico by john2
filipj2000
 
Прориси
ПрорисиПрориси
Прорисиbelyavsky
 
Aloha from hawaii
Aloha from hawaiiAloha from hawaii
Aloha from hawaiifilipj2000
 
Conselho de classe - Pedagogo César Tavares
Conselho de classe - Pedagogo César TavaresConselho de classe - Pedagogo César Tavares
Conselho de classe - Pedagogo César Tavares
CÉSAR TAVARES
 
Building a Successful Facebook Marketing Program
Building a Successful Facebook Marketing ProgramBuilding a Successful Facebook Marketing Program
Building a Successful Facebook Marketing Program
Internet Gal
 
SEPM Outsourcing
SEPM OutsourcingSEPM Outsourcing
SEPM Outsourcing
asherad
 
one
oneone
What pilotssee
What pilotsseeWhat pilotssee
What pilotsseefilipj2000
 
Bca 2010 draft
Bca 2010 draftBca 2010 draft
Bca 2010 draftk20eagan74
 
Introduction to Steens Furniture
Introduction to Steens FurnitureIntroduction to Steens Furniture
Introduction to Steens Furniture
Steens Furniture
 
K state candidacy presentation
K state candidacy presentationK state candidacy presentation
K state candidacy presentationTeague Allen
 
Vatican museum
Vatican museumVatican museum
Vatican museumfilipj2000
 

Viewers also liked (20)

Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part II
 
Internet messesage
Internet messesageInternet messesage
Internet messesage
 
Butchart gardens, canada
Butchart gardens, canadaButchart gardens, canada
Butchart gardens, canada
 
Mexico by john2
Mexico by john2  Mexico by john2
Mexico by john2
 
Прориси
ПрорисиПрориси
Прориси
 
Shanxi china
Shanxi  chinaShanxi  china
Shanxi china
 
Aloha from hawaii
Aloha from hawaiiAloha from hawaii
Aloha from hawaii
 
Conselho de classe - Pedagogo César Tavares
Conselho de classe - Pedagogo César TavaresConselho de classe - Pedagogo César Tavares
Conselho de classe - Pedagogo César Tavares
 
La nuit
La nuitLa nuit
La nuit
 
Building a Successful Facebook Marketing Program
Building a Successful Facebook Marketing ProgramBuilding a Successful Facebook Marketing Program
Building a Successful Facebook Marketing Program
 
SEPM Outsourcing
SEPM OutsourcingSEPM Outsourcing
SEPM Outsourcing
 
Death
DeathDeath
Death
 
one
oneone
one
 
What pilotssee
What pilotsseeWhat pilotssee
What pilotssee
 
Bca 2010 draft
Bca 2010 draftBca 2010 draft
Bca 2010 draft
 
Introduction to Steens Furniture
Introduction to Steens FurnitureIntroduction to Steens Furniture
Introduction to Steens Furniture
 
K state candidacy presentation
K state candidacy presentationK state candidacy presentation
K state candidacy presentation
 
Fund for a Healthy Maine PPT, July 2011
Fund for a Healthy Maine PPT, July 2011Fund for a Healthy Maine PPT, July 2011
Fund for a Healthy Maine PPT, July 2011
 
Madrid `in tanitimi
Madrid `in tanitimiMadrid `in tanitimi
Madrid `in tanitimi
 
Vatican museum
Vatican museumVatican museum
Vatican museum
 

Similar to Modeling Social Data, Lecture 2: Introduction to Counting

Computational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to CountingComputational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to Countingjakehofman
 
Schuster how to_avoid_mistakes_with_statistics_31052013
Schuster how to_avoid_mistakes_with_statistics_31052013Schuster how to_avoid_mistakes_with_statistics_31052013
Schuster how to_avoid_mistakes_with_statistics_31052013
Thomas_Schuster
 
Lecture 2 practical_guidelines_assignment
Lecture 2 practical_guidelines_assignmentLecture 2 practical_guidelines_assignment
Lecture 2 practical_guidelines_assignmentDaria Bogdanova
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regressionjakehofman
 
Year 9 Stats
Year 9 StatsYear 9 Stats
Year 9 Stats
WaihiCollege
 
Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statisticsalbertlaporte
 
Basic statistics
Basic statisticsBasic statistics
SocialCom 2013
SocialCom 2013SocialCom 2013
SocialCom 2013
Gabriele Tolomei
 
Mean and median
Mean and medianMean and median
Mean and median
Gautam Kumar
 
Co-clustering with augmented data
Co-clustering with augmented dataCo-clustering with augmented data
Co-clustering with augmented data
AllenWu
 
2.7.21 sampling methods data analysis
2.7.21 sampling methods data analysis2.7.21 sampling methods data analysis
2.7.21 sampling methods data analysis
Ashish965416
 
Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02jakehofman
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Parth Khare
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learning
Abhishek Vijayvargia
 
Penggambaran Data dengan Grafik
Penggambaran Data dengan GrafikPenggambaran Data dengan Grafik
Penggambaran Data dengan Grafik
anom0164
 
Classification
ClassificationClassification
Classification
Amit Kumar Rathi
 
Lab 1 intro
Lab 1 introLab 1 intro
Lab 1 intro
Erik D. Davenport
 
Introduction to Statistics (Part -I)
Introduction to Statistics (Part -I)Introduction to Statistics (Part -I)
Introduction to Statistics (Part -I)
YesAnalytics
 
Week11-EvaluationMethods.ppt
Week11-EvaluationMethods.pptWeek11-EvaluationMethods.ppt
Week11-EvaluationMethods.ppt
KamranAli649587
 
How to Choose a Sample for Your Thesis or Dissertation
How to Choose a Sample for Your Thesis or DissertationHow to Choose a Sample for Your Thesis or Dissertation
How to Choose a Sample for Your Thesis or Dissertation
Maria Sanchez
 

Similar to Modeling Social Data, Lecture 2: Introduction to Counting (20)

Computational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to CountingComputational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to Counting
 
Schuster how to_avoid_mistakes_with_statistics_31052013
Schuster how to_avoid_mistakes_with_statistics_31052013Schuster how to_avoid_mistakes_with_statistics_31052013
Schuster how to_avoid_mistakes_with_statistics_31052013
 
Lecture 2 practical_guidelines_assignment
Lecture 2 practical_guidelines_assignmentLecture 2 practical_guidelines_assignment
Lecture 2 practical_guidelines_assignment
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Year 9 Stats
Year 9 StatsYear 9 Stats
Year 9 Stats
 
Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statistics
 
Basic statistics
Basic statisticsBasic statistics
Basic statistics
 
SocialCom 2013
SocialCom 2013SocialCom 2013
SocialCom 2013
 
Mean and median
Mean and medianMean and median
Mean and median
 
Co-clustering with augmented data
Co-clustering with augmented dataCo-clustering with augmented data
Co-clustering with augmented data
 
2.7.21 sampling methods data analysis
2.7.21 sampling methods data analysis2.7.21 sampling methods data analysis
2.7.21 sampling methods data analysis
 
Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learning
 
Penggambaran Data dengan Grafik
Penggambaran Data dengan GrafikPenggambaran Data dengan Grafik
Penggambaran Data dengan Grafik
 
Classification
ClassificationClassification
Classification
 
Lab 1 intro
Lab 1 introLab 1 intro
Lab 1 intro
 
Introduction to Statistics (Part -I)
Introduction to Statistics (Part -I)Introduction to Statistics (Part -I)
Introduction to Statistics (Part -I)
 
Week11-EvaluationMethods.ppt
Week11-EvaluationMethods.pptWeek11-EvaluationMethods.ppt
Week11-EvaluationMethods.ppt
 
How to Choose a Sample for Your Thesis or Dissertation
How to Choose a Sample for Your Thesis or DissertationHow to Choose a Sample for Your Thesis or Dissertation
How to Choose a Sample for Your Thesis or Dissertation
 

More from jakehofman

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
jakehofman
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
jakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
jakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
jakehofman
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
jakehofman
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
jakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
jakehofman
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
jakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
jakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
jakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
jakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
jakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classificationjakehofman
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experimentsjakehofman
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
jakehofman
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part Ijakehofman
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIjakehofman
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part Ijakehofman
 

More from jakehofman (20)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part I
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part II
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part I
 

Recently uploaded

PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
plant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptxplant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptx
yusufzako14
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
insect morphology and physiology of insect
insect morphology and physiology of insectinsect morphology and physiology of insect
insect morphology and physiology of insect
anitaento25
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 

Recently uploaded (20)

PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
plant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptxplant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptx
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
insect morphology and physiology of insect
insect morphology and physiology of insectinsect morphology and physiology of insect
insect morphology and physiology of insect
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 

Modeling Social Data, Lecture 2: Introduction to Counting

  • 1. Introduction to Counting APAM E4990 Modeling Social Data Jake Hofman Columbia University January 30, 2013 Jake Hofman (Columbia University) Intro to Counting January 30, 2013 1 / 28
  • 2. Why counting? Jake Hofman (Columbia University) Intro to Counting January 30, 2013 2 / 28
  • 3. Why counting? sta·tis·tic noun 1. A function of a random sample of data Jake Hofman (Columbia University) Intro to Counting January 30, 2013 3 / 28
  • 4. Why counting? p( y support | x age ) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 4 / 28
  • 5. Why counting? ?p( y support | x1, x2, x3, . . . age, sex, race, party ) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 4 / 28
  • 6. Why counting? Problem: Traditionally difficult to obtain reliable estimates due to small sample sizes or sparsity (e.g., ∼ 100 age × 2 sex × 5 race × 3 party = 3000 groups, but typical surveys collect ∼ 1,000s of responses) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 5 / 28
  • 7. Why counting? Potential solution: Sacrifice granularity for precision, by binning observations into larger, but fewer, groups (e.g., bin age into a few groups: 18-29, 30-49, 50-64, 65+) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 5 / 28
  • 8. Why counting? Potential solution: Develop more sophisticated methods that generalize well from small samples (e.g., fit a model: support ∼ β0 + β1age + β2age2 + . . .) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 5 / 28
  • 9. Why counting? (Partial) solution: Obtain larger samples through other means, so we can just count and divide to make estimates via relative frequencies (e.g., with ∼ 1M responses, we have 100s per group and can estimate support within a few percentage points) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 6 / 28
  • 10. Why counting? The good: Shift away from sophisticated statistical methods on small samples to simple methods on large samples Jake Hofman (Columbia University) Intro to Counting January 30, 2013 7 / 28
  • 11. Why counting? The bad: Even simple methods (e.g., counting) are computationally challenging at large scales (1M is easy, 1B a bit less so, 1T gets interesting) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 7 / 28
  • 12. Why counting? Claim: Solving the counting problem at scale enables you to investigate many interesting questions in the social sciences Jake Hofman (Columbia University) Intro to Counting January 30, 2013 7 / 28
  • 13. Learning to count This week: Counting at small/medium scales on a single machine Jake Hofman (Columbia University) Intro to Counting January 30, 2013 8 / 28
  • 14. Learning to count This week: Counting at small/medium scales on a single machine Following weeks: Counting at large scales in parallel Jake Hofman (Columbia University) Intro to Counting January 30, 2013 8 / 28
  • 15. Counting, the easy way Split / Apply / Combine1 http://bit.ly/sactutorial • Load dataset into memory • Split: Arrange observations into groups of interest • Apply: Compute distributions and statistics within each group • Combine: Collect results across groups 1 http://bit.ly/splitapplycombine Jake Hofman (Columbia University) Intro to Counting January 30, 2013 9 / 28
  • 16. The generic group-by operation Split / Apply / Combine for each observation as (group, value): place value in bucket for corresponding group for each group: apply a function over values in bucket output group and result Jake Hofman (Columbia University) Intro to Counting January 30, 2013 10 / 28
  • 17. The generic group-by operation Split / Apply / Combine for each observation as (group, value): place value in bucket for corresponding group for each group: apply a function over values in bucket output group and result Useful for computing arbitrary within-group statistics when we have required memory (e.g., conditional distribution, median, etc.) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 10 / 28
  • 18. Why counting? Jake Hofman (Columbia University) Intro to Counting January 30, 2013 11 / 28
  • 19. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M Jake Hofman (Columbia University) Intro to Counting January 30, 2013 12 / 28
  • 20. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M Jake Hofman (Columbia University) Intro to Counting January 30, 2013 12 / 28
  • 21. Example: Movielens 0 1,000,000 2,000,000 3,000,000 1 2 3 4 5 Rating Numberofratings Jake Hofman (Columbia University) Intro to Counting January 30, 2013 13 / 28
  • 22. Example: Movielens 0 1,000,000 2,000,000 3,000,000 1 2 3 4 5 Rating Numberofratings group by rating value for each group: count # ratings Jake Hofman (Columbia University) Intro to Counting January 30, 2013 14 / 28
  • 23. Example: Movielens 1 2 3 4 5 Mean Rating by Movie Density Jake Hofman (Columbia University) Intro to Counting January 30, 2013 15 / 28
  • 24. Example: Movielens group by movie id for each group: compute average rating 1 2 3 4 5 Mean Rating by Movie Density Jake Hofman (Columbia University) Intro to Counting January 30, 2013 16 / 28
  • 25. Example: Movielens 0% 25% 50% 75% 100% 0 3,000 6,000 9,000 Movie Rank CDF Jake Hofman (Columbia University) Intro to Counting January 30, 2013 17 / 28
  • 26. Example: Movielens 0% 25% 50% 75% 100% 0 3,000 6,000 9,000 Movie Rank CDF group by movie id for each group: count # ratings sort by group size cumulatively sum group sizes Jake Hofman (Columbia University) Intro to Counting January 30, 2013 18 / 28
  • 27. Example: Movielens 0 2,000 4,000 6,000 8,000 100 10,000 User eccentricity Numberofusers Jake Hofman (Columbia University) Intro to Counting January 30, 2013 19 / 28
  • 28. Example: Movielens join movie ranks to ratings group by user id for each group: compute median movie rank 0 2,000 4,000 6,000 8,000 100 10,000 User eccentricity Numberofusers Jake Hofman (Columbia University) Intro to Counting January 30, 2013 20 / 28
  • 29. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
  • 30. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Sampling? Unreliable estimates for rare groups Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
  • 31. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Random access from disk? 1000x more storage, but 1000x slower2 2 Numbers every programmer should know Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
  • 32. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Streaming Read data one observation at a time, storing only needed state Jake Hofman (Columbia University) Intro to Counting January 30, 2013 21 / 28
  • 33. The combinable group-by operation Streaming for each observation as (group, value): if new group: initialize result update result for corresponding group as function of existing result and current value for each group: output group and result Jake Hofman (Columbia University) Intro to Counting January 30, 2013 22 / 28
  • 34. The combinable group-by operation Streaming for each observation as (group, value): if new group: initialize result update result for corresponding group as function of existing result and current value for each group: output group and result Useful for computing a subset of within-group statistics with a limited memory footprint (e.g., min, mean, max, variance, etc.) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 22 / 28
  • 35. Example: Movielens 0 1,000,000 2,000,000 3,000,000 1 2 3 4 5 Rating Numberofratings for each rating: counts[movie id]++ Jake Hofman (Columbia University) Intro to Counting January 30, 2013 23 / 28
  • 36. Example: Movielens for each rating: totals[movie id] += rating counts[movie id]++ for each group: totals[movie id] / counts[movie id] 1 2 3 4 5 Mean Rating by Movie Density Jake Hofman (Columbia University) Intro to Counting January 30, 2013 24 / 28
  • 37. Yet another group-by operation Per-group histograms for each observation as (group, value): histogram[group][value]++ for each group: compute result as a function of histogram output group and result Jake Hofman (Columbia University) Intro to Counting January 30, 2013 25 / 28
  • 38. Yet another group-by operation Per-group histograms for each observation as (group, value): histogram[group][value]++ for each group: compute result as a function of histogram output group and result We can recover arbitrary statistics if we can afford to store counts of all distinct values within in each group Jake Hofman (Columbia University) Intro to Counting January 30, 2013 25 / 28
  • 39. The group-by operation For arbitrary input data: Memory Scenario Distributions Statistics N Small dataset Yes General V*G Small distributions Yes General G Small # groups No Combinable V Small # outcomes No No 1 Large # both No No N = total number of observations G = number of distinct groups V = largest number of distinct values within group Jake Hofman (Columbia University) Intro to Counting January 30, 2013 26 / 28
  • 40. Examples (w/ 8GB RAM) Median rating by movie for Netflix N ∼ 100M ratings G ∼ 20K movies V ∼ 10 half-star values V *G ∼ 200K, store per-group histograms for arbitrary statistics (scales to arbitrary N, if you’re patient) Jake Hofman (Columbia University) Intro to Counting January 30, 2013 27 / 28
  • 41. Examples (w/ 8GB RAM) Median rating by video for YouTube N ∼ 10B ratings G ∼ 1B videos V ∼ 10 half-star values V *G ∼ 10B, fails because per-group histograms are too large to store in memory G ∼ 1B, but no (exact) calculation for streaming median Jake Hofman (Columbia University) Intro to Counting January 30, 2013 27 / 28
  • 42. Examples (w/ 8GB RAM) Mean rating by video for YouTube N ∼ 10B ratings G ∼ 1B videos V ∼ 10 half-star values G ∼ 1B, use streaming to compute combinable statistics Jake Hofman (Columbia University) Intro to Counting January 30, 2013 27 / 28
  • 43. The group-by operation For pre-grouped input data: Memory Scenario Distributions Statistics N Small dataset Yes General V*G Small distributions Yes General G Small # groups No Combinable V Small # outcomes Yes General 1 Large # both No Combinable N = total number of observations G = number of distinct groups V = largest number of distinct values within group Jake Hofman (Columbia University) Intro to Counting January 30, 2013 28 / 28