Upcoming SlideShare
×

Computational Social Science, Lecture 03: Counting at Scale, Part I

2,110 views

Published on

Published in: Education
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
2,110
On SlideShare
0
From Embeds
0
Number of Embeds
1,673
Actions
Shares
0
0
0
Likes
1
Embeds 0
No embeds

No notes for slide

Computational Social Science, Lecture 03: Counting at Scale, Part I

1. 1. Counting @ Scale Sharad Goel Columbia UniversityComputational Social Science: Lecture 3 February 8, 2013
2. 2. Descriptive statistics (as opposed to inferential statistics) is about counting contingency tables means, variances, quantilessummaries of conditional distributions
3. 3. Long tailvideo consumption on YouTube Digital dividetime spent across various online properties Viral diffusionpropagation of tweets on Twitter
4. 4. Counting @ scale conceptually easycomputationally hard
5. 5. I/O bound difficult to read terabytes of data Network bound hard to transfer terabytes of data Memory boundcannot randomly access data points CPU boundeven simple manipulations add up
6. 6. Rank videos by popularity local video store 1K movies, 100K viewings
7. 7. Rank videos by popularity local video store 1K movies, 100K viewings Load dataset into memory
8. 8. Rank videos by popularity Netflix 100K movies, 1B viewings
9. 9. Rank videos by popularity Netflix 100K movies, 1B viewingsstore counter for each movie in memory and stream through the dataset
10. 10. Rank videos by popularity YouTube 10B videos, 10T viewings
11. 11. Rank videos by popularity YouTube 10B videos, 10T viewings Trouble, with a capital ‘T’
12. 12. Parallel computationDistribute work across several machines
13. 13. 10 parallel workers 1T views per workermaybe 5B unique videos on each 100 parallel workers 100B views per worker maybe 1B videos on each
14. 14. split  count  sort by video  merge sort by popularity
15. 15. Core problemthe same movie appears on multiple machines Solution do not split viewing data at randomensure individual movies are never split apart
16. 16. split  count  sort by video  merge sort by popularity
17. 17. Shuffle (1st attempt) create a new file for every movieappend viewing data to the appropriate file
18. 18. Shuffle (2nd attempt) First time you see a movie, append it randomly to one of 10K filesNext time you see the movie, append it to same file
19. 19. Shuffle (3rd attempt) Hash the movie ID to determine which file to append it to( Hash function maps large input space to small output space approximately uniformly )
20. 20. MapReduce:Simplifed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI, 2004
21. 21. Mapassign each input line to one or more groups Shuffle aggregate groups Reduce operate on grouped data
22. 22. Mapassign each input line to one or more groups v  [(k1, v1), …, (km, vm)] Shuffle aggregate groups Reduce operate on grouped data (k, [v1, …, vn])  [w1, …, wp]
23. 23. The Insight of MapReduce One can efficiently group identical itemsMany tasks are computationally easier on grouped data
24. 24. Word Count Input text corpus Outputnumber of occurrences of each word
25. 25. Word Count Map line  words Reduceword group  size of group
26. 26. MapReducethe unreasonable effectiveness of aggregation