4. Environment Set-up
• Required
– Unix-like shell
• Linux
• Mac OS X
• Windows + Cygwin
– Python (e.g., anaconda)
• Good to have
– Java 8
– Hadoop 2.6
5. Moby Dick by Herman Melville
• Download Moby Dick:
wget
https://www.gutenberg.org/cache/epub/2701/p
g2701.txt
• Rename it input.txt:
mv pg2701.txt input.txt
9. Limitations
• Processing time is, at best, proportional to the
size of the text
• Actually, performance decreases with the size
of the dictionary
• Very large texts can require more than one
disk
12. MapReduce, Part 2: Shuffling
• Redistribute data based on the output keys
produced by the "mapper”
• So that all data belonging to one key is
grouped together