Hands on Hadoop

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    3 Favorites

    Hands on Hadoop - Presentation Transcript

    1. Paul Tarjan Chief Technical Monkey Yahoo! http://paulisageek.com
    2. Data is ugly
      • 1 08bade48-1081-488f-b459-6c75d75312ae 2 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 2.0 0.0 29 08bade48-1081-488f-b459-6c75d75312ae 3 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0
      • 29 08bade48-1081-488f-b459-6c75d75312ae 2 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0 11 08bade48-1081-488f-b459-6c75d75312ae 1 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 2.0 0.0
    3. Summaries are OK
    4. Graphs are pretty
    5. Word count, yay!
      • Given a document, how many of each word are there?
      • Real world :
        • Given our search logs, how many people click on result 1
        • Given our flickr photos, how many cat photos are there
        • Give our web crawl, what are the 10 most popular words?
    6. Map/Reduce
    7. Hadoop streaming
      • Basically, write some code that runs like :
      • $ cat data | mapper.py | sort | reducer.py
    8. Python wordcount mapper http://github.com/ptarjan/hands-on-hadoop-tutorial/blob/master/mapper.py
    9. Python wordcount reducer http://github.com/ptarjan/hands-on-hadoop-tutorial/blob/master/reducer.py
    10. Run
      • $ cat data | mapper.py | sort | reducer.py
      • If that works, run it live
        • Put the data into hadoops Distributed File System (DFS)
        • Run hadoop
        • Read the output data in the DFS
    11. Run Hadoop
      • Stream data through these two files, saving the output back to HDFS: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input input_dir -output output_dir -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py
    12. View output
      • View output files:
        • $ hadoop dfs -ls output_dir
      • Note multiple output files ("part-00000", "part-00001", etc)
      • View output file contents:
        • $ hadoop dfs -cat output_dir/part-00000
    13. Do it live!
      • http://github.com/ptarjan/hands-on-hadoop-tutorial/
    14. Live Commands
      • # make a new directory
      • mkdir count_example
      • cd count_example
      • # get the files (could be done with a git clone, but wget is more prevalent than git for now)
      • wget -o /dev/null http://github.com/ptarjan/hands-on-hadoop-tutorial/raw/master/mapper.py
      • wget -o /dev/null http://github.com/ptarjan/hands-on-hadoop-tutorial/raw/master/reducer.py
      • wget -o /dev/null http://github.com/ptarjan/hands-on-hadoop-tutorial/raw/master/reducer_numsort.py
      • wget -o /dev/null http://github.com/ptarjan/hands-on-hadoop-tutorial/raw/master/hamlet.txt
      • chmod u+x *.py
      • # run it locally
      • cat hamlet.txt | ./mapper.py | head
      • cat hamlet.txt | ./mapper.py | sort | ./reducer.py | head
      • cat hamlet.txt | ./mapper.py | sort | ./reducer.py | sort -k 2 -r -n | head
      • cat hamlet.txt | ./mapper.py | sort | ./reducer_numsort.py | head
      • # put the data in
      • hadoop fs -mkdir count_example
      • hadoop fs -put hamlet.txt count_example
      • # yahoo search specific - CHANGE TO YOUR OWN QUEUE
      • PARAMS=-Dmapred.job.queue.name=search_fast_lane
      • # run it
      • hadoop jar $HADOOP_HOME/hadoop-streaming.jar $PARAMS -mapper mapper.py -reducer reducer.py -input count_example/hamlet.txt -output count_example/hamlet_out -file mapper.py -file reducer.py
      • # view the output
      • hadoop fs -ls count_example/hamlet_out/
      • hadoop fs -cat count_example/hamlet_out/* | head
      • # run the num sorted
      • hadoop jar $HADOOP_HOME/hadoop-streaming.jar $PARAMS -mapper mapper.py -reducer reducer_numsort.py -input count_example/hamlet.txt -output count_example/hamlet_numsort_out -file mapper.py -file reducer_numsort.py
      • # view the output
      • mkdir out
      • hadoop fs -cat count_example/hamlet_numsort_out/* | sort -nrk 2 > out/hamlet_numsort.txt
      • head out/hamlet_numsort.txt
      • # test that hadoop worked
      • # apply the same sort to make sure ties are broken the same way
      • cat hamlet.txt | ./mapper.py | sort | ./reducer_numsort.py | sort -nrk 2 > out/hamlet_numsort_local.txt
      • diff out/hamlet_numsort.txt out/hamlet_numsort_local.txt
      • # EXTRA (wikipedia)
      • wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz
      • gunzip -c enwiki-latest-all-titles-in-ns0.gz | hadoop fs -put - count_example/wiki_titles
      • hadoop jar $HADOOP_HOME/hadoop-streaming.jar $PARAMS -mapper mapper.py -reducer reducer_numsort.py -input count_example/wiki_titles -output count_example/wiki_titles_out -file mapper.py -file reducer_numsort.py
      • hadoop fs -cat count_example/wiki_titles_out/* | head
      • # cleanup
      • hadoop fs -rmr count_example
      • cd ..
      • rm -r count_example

    + Paul TarjanPaul Tarjan, 1 month ago

    custom

    269 views, 3 favs, 0 embeds more stats

    My intro talk for hadoop and how to use it with pyt more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 269
      • 269 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 3
    • Downloads 7
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories