Hands on Hadoop


Published on

My intro talk for hadoop and how to use it with python streaming.

Code is here : http://github.com/ptarjan/hands-on-hadoop-tutorial/

Published in: Education, Technology

Hands on Hadoop

  1. 1. Paul Tarjan Chief Technical Monkey Yahoo! http://paulisageek.com
  2. 2. Data is ugly <ul><li>1 08bade48-1081-488f-b459-6c75d75312ae 2 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 2.0 0.0 29 08bade48-1081-488f-b459-6c75d75312ae 3 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0 </li></ul><ul><li>29 08bade48-1081-488f-b459-6c75d75312ae 2 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0 11 08bade48-1081-488f-b459-6c75d75312ae 1 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 2.0 0.0 </li></ul>
  3. 3. Summaries are OK
  4. 4. Graphs are pretty
  5. 5. Word count, yay! <ul><li>Given a document, how many of each word are there? </li></ul><ul><li>Real world : </li></ul><ul><ul><li>Given our search logs, how many people click on result 1 </li></ul></ul><ul><ul><li>Given our flickr photos, how many cat photos are there </li></ul></ul><ul><ul><li>Give our web crawl, what are the 10 most popular words? </li></ul></ul>
  6. 6. Map/Reduce
  7. 7. Hadoop streaming <ul><li>Basically, write some code that runs like : </li></ul><ul><li>$ cat data | mapper.py | sort | reducer.py </li></ul>
  8. 8. Python wordcount mapper http://github.com/ptarjan/hands-on-hadoop-tutorial/blob/master/mapper.py
  9. 9. Python wordcount reducer http://github.com/ptarjan/hands-on-hadoop-tutorial/blob/master/reducer.py
  10. 10. Run <ul><li>$ cat data | mapper.py | sort | reducer.py </li></ul><ul><li>If that works, run it live </li></ul><ul><ul><li>Put the data into hadoops Distributed File System (DFS) </li></ul></ul><ul><ul><li>Run hadoop </li></ul></ul><ul><ul><li>Read the output data in the DFS </li></ul></ul>
  11. 11. Run Hadoop <ul><li>Stream data through these two files, saving the output back to HDFS: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input input_dir -output output_dir -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py </li></ul>
  12. 12. View output <ul><li>View output files: </li></ul><ul><ul><li>$ hadoop dfs -ls output_dir </li></ul></ul><ul><li>Note multiple output files (&quot;part-00000&quot;, &quot;part-00001&quot;, etc) </li></ul><ul><li>View output file contents: </li></ul><ul><ul><li>$ hadoop dfs -cat output_dir/part-00000 </li></ul></ul>
  13. 13. Do it live! <ul><li>http://github.com/ptarjan/hands-on-hadoop-tutorial/ </li></ul>
  14. 14. Live Commands <ul><li># make a new directory </li></ul><ul><li>mkdir count_example </li></ul><ul><li>cd count_example </li></ul><ul><li># get the files (could be done with a git clone, but wget is more prevalent than git for now) </li></ul><ul><li>wget -o /dev/null http://github.com/ptarjan/hands-on-hadoop-tutorial/raw/master/mapper.py </li></ul><ul><li>wget -o /dev/null http://github.com/ptarjan/hands-on-hadoop-tutorial/raw/master/reducer.py </li></ul><ul><li>wget -o /dev/null http://github.com/ptarjan/hands-on-hadoop-tutorial/raw/master/reducer_numsort.py </li></ul><ul><li>wget -o /dev/null http://github.com/ptarjan/hands-on-hadoop-tutorial/raw/master/hamlet.txt </li></ul><ul><li>chmod u+x *.py </li></ul><ul><li># run it locally </li></ul><ul><li>cat hamlet.txt | ./mapper.py | head </li></ul><ul><li>cat hamlet.txt | ./mapper.py | sort | ./reducer.py | head </li></ul><ul><li>cat hamlet.txt | ./mapper.py | sort | ./reducer.py | sort -k 2 -r -n | head </li></ul><ul><li>cat hamlet.txt | ./mapper.py | sort | ./reducer_numsort.py | head </li></ul><ul><li># put the data in </li></ul><ul><li>hadoop fs -mkdir count_example </li></ul><ul><li>hadoop fs -put hamlet.txt count_example </li></ul><ul><li># yahoo search specific - CHANGE TO YOUR OWN QUEUE </li></ul><ul><li>PARAMS=-Dmapred.job.queue.name=search_fast_lane </li></ul><ul><li># run it </li></ul><ul><li>hadoop jar $HADOOP_HOME/hadoop-streaming.jar $PARAMS -mapper mapper.py -reducer reducer.py -input count_example/hamlet.txt -output count_example/hamlet_out -file mapper.py -file reducer.py </li></ul><ul><li># view the output </li></ul><ul><li>hadoop fs -ls count_example/hamlet_out/ </li></ul><ul><li>hadoop fs -cat count_example/hamlet_out/* | head </li></ul><ul><li># run the num sorted </li></ul><ul><li>hadoop jar $HADOOP_HOME/hadoop-streaming.jar $PARAMS -mapper mapper.py -reducer reducer_numsort.py -input count_example/hamlet.txt -output count_example/hamlet_numsort_out -file mapper.py -file reducer_numsort.py </li></ul><ul><li># view the output </li></ul><ul><li>mkdir out </li></ul><ul><li>hadoop fs -cat count_example/hamlet_numsort_out/* | sort -nrk 2 > out/hamlet_numsort.txt </li></ul><ul><li>head out/hamlet_numsort.txt </li></ul><ul><li># test that hadoop worked </li></ul><ul><li># apply the same sort to make sure ties are broken the same way </li></ul><ul><li>cat hamlet.txt | ./mapper.py | sort | ./reducer_numsort.py | sort -nrk 2 > out/hamlet_numsort_local.txt </li></ul><ul><li>diff out/hamlet_numsort.txt out/hamlet_numsort_local.txt </li></ul><ul><li># EXTRA (wikipedia) </li></ul><ul><li>wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz </li></ul><ul><li>gunzip -c enwiki-latest-all-titles-in-ns0.gz | hadoop fs -put - count_example/wiki_titles </li></ul><ul><li>hadoop jar $HADOOP_HOME/hadoop-streaming.jar $PARAMS -mapper mapper.py -reducer reducer_numsort.py -input count_example/wiki_titles -output count_example/wiki_titles_out -file mapper.py -file reducer_numsort.py </li></ul><ul><li>hadoop fs -cat count_example/wiki_titles_out/* | head </li></ul><ul><li># cleanup </li></ul><ul><li>hadoop fs -rmr count_example </li></ul><ul><li>cd .. </li></ul><ul><li>rm -r count_example </li></ul>