SlideShare a Scribd company logo
1 of 47
Download to read offline
Counting at Scale
APAM E4990
Modeling Social Data
Jake Hofman
Columbia University
February 10, 2017
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 1 / 29
Previously
Claim:
Solving the counting problem at scale enables you to investigate
many interesting questions in the social sciences
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 2 / 29
Learning to count
Last week:
Counting at small/medium scales on a single machine
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 3 / 29
Learning to count
Last week:
Counting at small/medium scales on a single machine
This week:
Counting at large scales in parallel
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 3 / 29
What?
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 4 / 29
What?
“... to create building blocks for programmers who just
happen to have lots of data to store, lots of data to
analyze, or lots of machines to coordinate, and who
don’t have the time, the skill, or the inclination to
become distributed systems experts to build the
infrastructure to handle it.”
-Tom White
Hadoop: The Definitive Guide
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 5 / 29
What?
Hadoop contains many subprojects:
We’ll focus on distributed computation with MapReduce.
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 6 / 29
Who/when?
An overly brief history
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29
Who/when?
pre-2004
Doug Cutting and Mike Cafarella develop open source projects for
web-scale indexing, crawling, and search
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29
Who/when?
2004
Dean and Ghemawat publish MapReduce programming model,
used internally at Google
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29
Who/when?
2006
Hadoop becomes official Apache project, Cutting joins Yahoo!,
Yahoo adopts Hadoop
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29
Who/when?
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29
Where?
http://wiki.apache.org/hadoop/PoweredBy
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 8 / 29
Why?
Why yet another solution?
(I already use too many languages/environments)
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 9 / 29
Why?
Why a distributed solution?
(My desktop has TBs of storage and GBs of memory)
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 9 / 29
Why?
Roughly how long to read 1TB from a commodity hard disk?
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 10 / 29
Why?
Roughly how long to read 1TB from a commodity hard disk?
1
2
Gb
sec
×
1
8
B
b
× 3600
sec
hr
≈ 225
GB
hr
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 10 / 29
Why?
Roughly how long to read 1TB from a commodity hard disk?1
≈ 4hrs
1
SSDs provide a ∼10x speedup
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 10 / 29
Why?
http://bit.ly/petabytesort
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 11 / 29
Typical scenario
Store, parse, and analyze high-volume server logs,
e.g. how many search queries match “icwsm”?
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 12 / 29
MapReduce: 30k ft
Break large problem into smaller parts, solve in parallel, combine
results
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 13 / 29
Typical scenario
“Embarassingly parallel”
(or nearly so)
node 1
local read filter
node 2
local read filter
node 3
local read filter
node 4
local read filter
}collect results
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 14 / 29
Typical scenario++
How many search queries match “icwsm”, grouped by month?
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 15 / 29
MapReduce: example
20091201,4.2.2.1,"icwsm 2010"
20100523,2.4.1.2,"hadoop"
20100101,9.7.6.5,"tutorial"
20091125,2.4.6.1,"data"
20090708,4.2.2.1,"open source"
20100124,1.2.2.4,"washington dc"
20100522,2.4.1.2,"conference"
20091008,4.2.2.1,"2009 icwsm"
20090807,4.2.2.1,"apache.org"
20100101,9.7.6.5,"mapreduce"
20100123,1.2.2.4,"washington dc"
20091121,2.4.6.1,"icwsm dates"
20090807,4.2.2.1,"distributed"
20091225,4.2.2.1,"icwsm"
20100522,2.4.1.2,"media"
20100123,1.2.2.4,"social"
20091114,2.4.6.1,"d.c."
20100101,9.7.6.5,"new year's"
Map
matching records to
(YYYYMM, count=1)
200912, 1
200910, 1
200911, 1
200912, 1
200910, 1
...
200912, 1
200912, 1
...
200911, 1
200910, 1
...
200912, 2
...
200911, 1
Shuffle
to collect all records
w/ same key (month)
Reduce
results by adding
count values for each key
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 16 / 29
MapReduce: paradigm
Programmer specifies map and reduce functions
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29
MapReduce: paradigm
Map: tranforms input record to intermediate (key, value) pair
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29
MapReduce: paradigm
Shuffle: collects all intermediate records by key
Record assigned to reducers by hash(key) % num reducers
Reducers perform a merge sort to collect records with same key
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29
MapReduce: paradigm
Reduce: transforms all records for given key to final output
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29
MapReduce: paradigm
Distributed read, shuffle, and write are transparent to programmer
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29
MapReduce: principles
• Move code to data (local computation)
• Allow programs to scale transparently w.r.t size of input
• Abstract away fault tolerance, synchronization, etc.
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 18 / 29
MapReduce: strengths
• Batch, offline jobs
• Write-once, read-many across full data set
• Usually, though not always, simple computations
• I/O bound by disk/network bandwidth
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 19 / 29
!MapReduce
What it’s not:
• High-performance parallel computing, e.g. MPI
• Low-latency random access relational database
• Always the right solution
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 20 / 29
Word count
dog 2
-- 1
the 3
brown 1
fox 2
jumped 1
lazy 2
jumps 1
over 2
quick 1
that 1
who 1
? 1
the quick brown fox
jumps over the lazy dog
who jumped over that
lazy dog -- the fox ?
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29
Word count
Map: for each line, output each word and count (of 1)
the quick brown fox
--------------------------------
jumps over the lazy dog
--------------------------------
who jumped over that
--------------------------------
lazy dog -- the fox ?
the 1
quick 1
brown 1
fox 1
---------
jumps 1
over 1
the 1
lazy 1
dog 1
---------
who 1
jumped 1
over 1
---------
that 1
lazy 1
dog 1
-- 1
the 1
fox 1
? 1
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29
Word count
Shuffle: collect all records for each word
the quick brown fox
--------------------------------
jumps over the lazy dog
--------------------------------
who jumped over that
--------------------------------
lazy dog -- the fox ?
-- 1
---------
? 1
---------
brown 1
---------
dog 1
dog 1
---------
fox 1
fox 1
---------
jumped 1
---------
jumps 1
---------
lazy 1
lazy 1
---------
over 1
over 1
---------
quick 1
---------
that 1
---------
the 1
the 1
the 1
---------
who 1
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29
Word count
Reduce: add counts for each word
-- 1
---------
? 1
---------
brown 1
---------
dog 1
dog 1
---------
fox 1
fox 1
---------
jumped 1
---------
jumps 1
---------
lazy 1
lazy 1
---------
over 1
over 1
---------
quick 1
---------
that 1
---------
the 1
the 1
the 1
---------
who 1
-- 1
? 1
brown 1
dog 2
fox 2
jumped 1
jumps 1
lazy 2
over 2
quick 1
that 1
the 3
who 1
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29
Word count
dog 1
dog 1
---------
-- 1
---------
the 1
the 1
the 1
---------
brown 1
---------
fox 1
fox 1
---------
jumped 1
---------
lazy 1
lazy 1
---------
jumps 1
---------
over 1
over 1
---------
quick 1
---------
that 1
---------
? 1
---------
who 1
dog 2
-- 1
the 3
brown 1
fox 2
jumped 1
lazy 2
jumps 1
over 2
quick 1
that 1
who 1
? 1
the quick brown fox
--------------------------------
jumps over the lazy dog
--------------------------------
who jumped over that
--------------------------------
lazy dog -- the fox ?
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29
WordCount.java
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 22 / 29
Hadoop streaming
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 23 / 29
Hadoop streaming
MapReduce for *nix geeks2:
# cat data | map | sort | reduce
Where the sort is a hack to approximate a distributed group-by:
• Mapper reads input data from stdin
• Mapper writes output to stdout
• Reducer receives input, grouped by key, on stdin
• Reducer writes output to stdout
2
http://bit.ly/michaelnoll
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 24 / 29
wordcount.sh
Locally:
# cat data | tr " " "n" | sort | uniq -c
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 25 / 29
wordcount.sh
Locally:
# cat data | tr " " "n" | sort | uniq -c
⇓
Distributed:
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 25 / 29
Transparent scaling
Use the same code on MBs locally or TBs across thousands
of machines.
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 26 / 29
wordcount.py
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 27 / 29
Higher level abstractions
Higher level languages like Pig or Hive provide robust
implementations for many common MapReduce operations
(e.g., filter, sort, join, group by, etc.)
They also allow for user-defined map and reduce functions
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 28 / 29
Higher level abstractions
Pig looks like SQL, but is more procedural
pig.apache.org
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 29 / 29
Higher level abstractions
Hive is closer to SQL, with nested declarative statements
hive.apache.org
Jake Hofman (Columbia University) Counting at Scale February 10, 2017 29 / 29

More Related Content

Viewers also liked

Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overviewjakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scalejakehofman
 
Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03jakehofman
 
Bitcoin's Killer App 2017 - Sean Walsh
Bitcoin's Killer App 2017 -  Sean WalshBitcoin's Killer App 2017 -  Sean Walsh
Bitcoin's Killer App 2017 - Sean WalshSean Walsh
 
Recomendaciones integrales de política pública para las juventudes en la ar...
Recomendaciones integrales de política pública para las juventudes en la ar...Recomendaciones integrales de política pública para las juventudes en la ar...
Recomendaciones integrales de política pública para las juventudes en la ar...Jorge Roldán
 
Cuadro de los pagos tributarios
Cuadro de los pagos tributariosCuadro de los pagos tributarios
Cuadro de los pagos tributariosselvagomez2872
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...Andreas Önnerfors
 
Charitable trust ppt
Charitable trust pptCharitable trust ppt
Charitable trust pptSweety Sharma
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboralalexander_hv
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBGord Sissons
 
ระบบสารสนเทศ
ระบบสารสนเทศระบบสารสนเทศ
ระบบสารสนเทศPetch Boonyakorn
 
Amazon alexa - building custom skills
Amazon alexa - building custom skillsAmazon alexa - building custom skills
Amazon alexa - building custom skillsAniruddha Chakrabarti
 
2016 Results & Outlook
2016 Results & Outlook 2016 Results & Outlook
2016 Results & Outlook Total
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
Teraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationTeraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationGord Sissons
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLSimon Harris
 

Viewers also liked (20)

Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03
 
Bitcoin's Killer App 2017 - Sean Walsh
Bitcoin's Killer App 2017 -  Sean WalshBitcoin's Killer App 2017 -  Sean Walsh
Bitcoin's Killer App 2017 - Sean Walsh
 
Recomendaciones integrales de política pública para las juventudes en la ar...
Recomendaciones integrales de política pública para las juventudes en la ar...Recomendaciones integrales de política pública para las juventudes en la ar...
Recomendaciones integrales de política pública para las juventudes en la ar...
 
Social Media Policies
Social Media PoliciesSocial Media Policies
Social Media Policies
 
Cuadro de los pagos tributarios
Cuadro de los pagos tributariosCuadro de los pagos tributarios
Cuadro de los pagos tributarios
 
Blog
BlogBlog
Blog
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
 
Charitable trust ppt
Charitable trust pptCharitable trust ppt
Charitable trust ppt
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboral
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TB
 
ระบบสารสนเทศ
ระบบสารสนเทศระบบสารสนเทศ
ระบบสารสนเทศ
 
Amazon alexa - building custom skills
Amazon alexa - building custom skillsAmazon alexa - building custom skills
Amazon alexa - building custom skills
 
2016 Results & Outlook
2016 Results & Outlook 2016 Results & Outlook
2016 Results & Outlook
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
Teraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview PresentationTeraproc Application Cluster-as-a-Service Overview Presentation
Teraproc Application Cluster-as-a-Service Overview Presentation
 
CV_of_Ashish Agarwal
CV_of_Ashish AgarwalCV_of_Ashish Agarwal
CV_of_Ashish Agarwal
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQL
 

More from jakehofman

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networksjakehofman
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classificationjakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationjakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systemsjakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studiesjakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Sciencejakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classificationjakehofman
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regressionjakehofman
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experimentsjakehofman
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wranglingjakehofman
 
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIjakehofman
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part Ijakehofman
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIjakehofman
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part Ijakehofman
 
Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIComputational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIjakehofman
 
Computational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part IComputational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part Ijakehofman
 

More from jakehofman (20)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part II
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part I
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part II
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part I
 
Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIComputational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part II
 
Computational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part IComputational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part I
 

Recently uploaded

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 

Recently uploaded (20)

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 

Modeling Social Data, Lecture 4: Counting at Scale

  • 1. Counting at Scale APAM E4990 Modeling Social Data Jake Hofman Columbia University February 10, 2017 Jake Hofman (Columbia University) Counting at Scale February 10, 2017 1 / 29
  • 2. Previously Claim: Solving the counting problem at scale enables you to investigate many interesting questions in the social sciences Jake Hofman (Columbia University) Counting at Scale February 10, 2017 2 / 29
  • 3. Learning to count Last week: Counting at small/medium scales on a single machine Jake Hofman (Columbia University) Counting at Scale February 10, 2017 3 / 29
  • 4. Learning to count Last week: Counting at small/medium scales on a single machine This week: Counting at large scales in parallel Jake Hofman (Columbia University) Counting at Scale February 10, 2017 3 / 29
  • 5. What? Jake Hofman (Columbia University) Counting at Scale February 10, 2017 4 / 29
  • 6. What? “... to create building blocks for programmers who just happen to have lots of data to store, lots of data to analyze, or lots of machines to coordinate, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.” -Tom White Hadoop: The Definitive Guide Jake Hofman (Columbia University) Counting at Scale February 10, 2017 5 / 29
  • 7. What? Hadoop contains many subprojects: We’ll focus on distributed computation with MapReduce. Jake Hofman (Columbia University) Counting at Scale February 10, 2017 6 / 29
  • 8. Who/when? An overly brief history Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29
  • 9. Who/when? pre-2004 Doug Cutting and Mike Cafarella develop open source projects for web-scale indexing, crawling, and search Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29
  • 10. Who/when? 2004 Dean and Ghemawat publish MapReduce programming model, used internally at Google Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29
  • 11. Who/when? 2006 Hadoop becomes official Apache project, Cutting joins Yahoo!, Yahoo adopts Hadoop Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29
  • 12. Who/when? Jake Hofman (Columbia University) Counting at Scale February 10, 2017 7 / 29
  • 13. Where? http://wiki.apache.org/hadoop/PoweredBy Jake Hofman (Columbia University) Counting at Scale February 10, 2017 8 / 29
  • 14. Why? Why yet another solution? (I already use too many languages/environments) Jake Hofman (Columbia University) Counting at Scale February 10, 2017 9 / 29
  • 15. Why? Why a distributed solution? (My desktop has TBs of storage and GBs of memory) Jake Hofman (Columbia University) Counting at Scale February 10, 2017 9 / 29
  • 16. Why? Roughly how long to read 1TB from a commodity hard disk? Jake Hofman (Columbia University) Counting at Scale February 10, 2017 10 / 29
  • 17. Why? Roughly how long to read 1TB from a commodity hard disk? 1 2 Gb sec × 1 8 B b × 3600 sec hr ≈ 225 GB hr Jake Hofman (Columbia University) Counting at Scale February 10, 2017 10 / 29
  • 18. Why? Roughly how long to read 1TB from a commodity hard disk?1 ≈ 4hrs 1 SSDs provide a ∼10x speedup Jake Hofman (Columbia University) Counting at Scale February 10, 2017 10 / 29
  • 19. Why? http://bit.ly/petabytesort Jake Hofman (Columbia University) Counting at Scale February 10, 2017 11 / 29
  • 20. Typical scenario Store, parse, and analyze high-volume server logs, e.g. how many search queries match “icwsm”? Jake Hofman (Columbia University) Counting at Scale February 10, 2017 12 / 29
  • 21. MapReduce: 30k ft Break large problem into smaller parts, solve in parallel, combine results Jake Hofman (Columbia University) Counting at Scale February 10, 2017 13 / 29
  • 22. Typical scenario “Embarassingly parallel” (or nearly so) node 1 local read filter node 2 local read filter node 3 local read filter node 4 local read filter }collect results Jake Hofman (Columbia University) Counting at Scale February 10, 2017 14 / 29
  • 23. Typical scenario++ How many search queries match “icwsm”, grouped by month? Jake Hofman (Columbia University) Counting at Scale February 10, 2017 15 / 29
  • 24. MapReduce: example 20091201,4.2.2.1,"icwsm 2010" 20100523,2.4.1.2,"hadoop" 20100101,9.7.6.5,"tutorial" 20091125,2.4.6.1,"data" 20090708,4.2.2.1,"open source" 20100124,1.2.2.4,"washington dc" 20100522,2.4.1.2,"conference" 20091008,4.2.2.1,"2009 icwsm" 20090807,4.2.2.1,"apache.org" 20100101,9.7.6.5,"mapreduce" 20100123,1.2.2.4,"washington dc" 20091121,2.4.6.1,"icwsm dates" 20090807,4.2.2.1,"distributed" 20091225,4.2.2.1,"icwsm" 20100522,2.4.1.2,"media" 20100123,1.2.2.4,"social" 20091114,2.4.6.1,"d.c." 20100101,9.7.6.5,"new year's" Map matching records to (YYYYMM, count=1) 200912, 1 200910, 1 200911, 1 200912, 1 200910, 1 ... 200912, 1 200912, 1 ... 200911, 1 200910, 1 ... 200912, 2 ... 200911, 1 Shuffle to collect all records w/ same key (month) Reduce results by adding count values for each key Jake Hofman (Columbia University) Counting at Scale February 10, 2017 16 / 29
  • 25. MapReduce: paradigm Programmer specifies map and reduce functions Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29
  • 26. MapReduce: paradigm Map: tranforms input record to intermediate (key, value) pair Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29
  • 27. MapReduce: paradigm Shuffle: collects all intermediate records by key Record assigned to reducers by hash(key) % num reducers Reducers perform a merge sort to collect records with same key Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29
  • 28. MapReduce: paradigm Reduce: transforms all records for given key to final output Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29
  • 29. MapReduce: paradigm Distributed read, shuffle, and write are transparent to programmer Jake Hofman (Columbia University) Counting at Scale February 10, 2017 17 / 29
  • 30. MapReduce: principles • Move code to data (local computation) • Allow programs to scale transparently w.r.t size of input • Abstract away fault tolerance, synchronization, etc. Jake Hofman (Columbia University) Counting at Scale February 10, 2017 18 / 29
  • 31. MapReduce: strengths • Batch, offline jobs • Write-once, read-many across full data set • Usually, though not always, simple computations • I/O bound by disk/network bandwidth Jake Hofman (Columbia University) Counting at Scale February 10, 2017 19 / 29
  • 32. !MapReduce What it’s not: • High-performance parallel computing, e.g. MPI • Low-latency random access relational database • Always the right solution Jake Hofman (Columbia University) Counting at Scale February 10, 2017 20 / 29
  • 33. Word count dog 2 -- 1 the 3 brown 1 fox 2 jumped 1 lazy 2 jumps 1 over 2 quick 1 that 1 who 1 ? 1 the quick brown fox jumps over the lazy dog who jumped over that lazy dog -- the fox ? Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29
  • 34. Word count Map: for each line, output each word and count (of 1) the quick brown fox -------------------------------- jumps over the lazy dog -------------------------------- who jumped over that -------------------------------- lazy dog -- the fox ? the 1 quick 1 brown 1 fox 1 --------- jumps 1 over 1 the 1 lazy 1 dog 1 --------- who 1 jumped 1 over 1 --------- that 1 lazy 1 dog 1 -- 1 the 1 fox 1 ? 1 Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29
  • 35. Word count Shuffle: collect all records for each word the quick brown fox -------------------------------- jumps over the lazy dog -------------------------------- who jumped over that -------------------------------- lazy dog -- the fox ? -- 1 --------- ? 1 --------- brown 1 --------- dog 1 dog 1 --------- fox 1 fox 1 --------- jumped 1 --------- jumps 1 --------- lazy 1 lazy 1 --------- over 1 over 1 --------- quick 1 --------- that 1 --------- the 1 the 1 the 1 --------- who 1 Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29
  • 36. Word count Reduce: add counts for each word -- 1 --------- ? 1 --------- brown 1 --------- dog 1 dog 1 --------- fox 1 fox 1 --------- jumped 1 --------- jumps 1 --------- lazy 1 lazy 1 --------- over 1 over 1 --------- quick 1 --------- that 1 --------- the 1 the 1 the 1 --------- who 1 -- 1 ? 1 brown 1 dog 2 fox 2 jumped 1 jumps 1 lazy 2 over 2 quick 1 that 1 the 3 who 1 Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29
  • 37. Word count dog 1 dog 1 --------- -- 1 --------- the 1 the 1 the 1 --------- brown 1 --------- fox 1 fox 1 --------- jumped 1 --------- lazy 1 lazy 1 --------- jumps 1 --------- over 1 over 1 --------- quick 1 --------- that 1 --------- ? 1 --------- who 1 dog 2 -- 1 the 3 brown 1 fox 2 jumped 1 lazy 2 jumps 1 over 2 quick 1 that 1 who 1 ? 1 the quick brown fox -------------------------------- jumps over the lazy dog -------------------------------- who jumped over that -------------------------------- lazy dog -- the fox ? Jake Hofman (Columbia University) Counting at Scale February 10, 2017 21 / 29
  • 38. WordCount.java Jake Hofman (Columbia University) Counting at Scale February 10, 2017 22 / 29
  • 39. Hadoop streaming Jake Hofman (Columbia University) Counting at Scale February 10, 2017 23 / 29
  • 40. Hadoop streaming MapReduce for *nix geeks2: # cat data | map | sort | reduce Where the sort is a hack to approximate a distributed group-by: • Mapper reads input data from stdin • Mapper writes output to stdout • Reducer receives input, grouped by key, on stdin • Reducer writes output to stdout 2 http://bit.ly/michaelnoll Jake Hofman (Columbia University) Counting at Scale February 10, 2017 24 / 29
  • 41. wordcount.sh Locally: # cat data | tr " " "n" | sort | uniq -c Jake Hofman (Columbia University) Counting at Scale February 10, 2017 25 / 29
  • 42. wordcount.sh Locally: # cat data | tr " " "n" | sort | uniq -c ⇓ Distributed: Jake Hofman (Columbia University) Counting at Scale February 10, 2017 25 / 29
  • 43. Transparent scaling Use the same code on MBs locally or TBs across thousands of machines. Jake Hofman (Columbia University) Counting at Scale February 10, 2017 26 / 29
  • 44. wordcount.py Jake Hofman (Columbia University) Counting at Scale February 10, 2017 27 / 29
  • 45. Higher level abstractions Higher level languages like Pig or Hive provide robust implementations for many common MapReduce operations (e.g., filter, sort, join, group by, etc.) They also allow for user-defined map and reduce functions Jake Hofman (Columbia University) Counting at Scale February 10, 2017 28 / 29
  • 46. Higher level abstractions Pig looks like SQL, but is more procedural pig.apache.org Jake Hofman (Columbia University) Counting at Scale February 10, 2017 29 / 29
  • 47. Higher level abstractions Hive is closer to SQL, with nested declarative statements hive.apache.org Jake Hofman (Columbia University) Counting at Scale February 10, 2017 29 / 29