A Hands-on Introduction to
MapReduce in Python
David Massart, PhD
Who Am I ?
Outline
• Set-up and requirements
• Counting words
• Limitations
• Map / Reduce
– Mapping
– Shuffling
– Reducing
• Hadoop
Environment Set-up
• Required
– Unix-like shell
• Linux
• Mac OS X
• Windows + Cygwin
– Python (e.g., anaconda)
• Good to have
– Java 8
– Hadoop 2.6
Moby Dick by Herman Melville
• Download Moby Dick:
wget
https://www.gutenberg.org/cache/epub/2701/p
g2701.txt
• Rename it input.txt:
mv pg2701.txt input.txt
cat input.txt
Counting Words
./counter.py < input.txt
Limitations
• Processing time is, at best, proportional to the
size of the text
• Actually, performance decreases with the size
of the dictionary
• Very large texts can require more than one
disk
MapReduce, Part 1: Mapping
./mapper.py < input.txt
MapReduce, Part 2: Shuffling
• Redistribute data based on the output keys
produced by the "mapper”
• So that all data belonging to one key is
grouped together
./mapper.py < input.txt | sort
MapReduce, Part 3: Reducing
./mapper.py < input.txt | sort | ./reducer.py
Hadoop
More details available at
http://zettadatanet.wordpress.com
These slides are available at
http://www.slideshare.net/dmassart
/mapreduce-20150315

A Hands-on Introduction to MapReduce (in Python)

Editor's Notes

  • #15 Fact names need to be normalized too.