A brief, hands-on introduction to Hadoop & Pig

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    5 Favorites

    A brief, hands-on introduction to Hadoop & Pig - Presentation Transcript

    1. A brief, hands-on introduction to Hadoop & Pig Erik Eldridge Yahoo! Developer Network OSCAMP 2009 Photo credit: http://www.flickr.com/photos/mckaysavage/1059144105/sizes/l/
    2. Preamble
      • Intention: approach hadoop from a tool-user's perspective, specifically, a web dev's perspective
      • Intended audience: anyone with a desire to begin using Hadoop
    3. Requirements
      • VMWare
        • Hadoop will be demonstrated using a VMWare virtual machine
        • I’ve found the use of a virtual machine to be the easiest way to get started with Hadoop
    4. Setup VM
      • Get hadoop vm from: http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup
      • Note:
        • user name: hadoop-user
        • password: hadoop
      • Launch vm
      • Log in
      • Note ip of machine
    5. Start Hadoop
      • Run the util to launch hadoop: $ ~/start-hadoop
      • If it's already running, we'll get an error like "172.16.83.132: datanode running as process 6752. Stop it first. 172.16.83.132: secondarynamenode running as process 6845. Stop it first. ..."
    6. Saying “hi” to Hadoop
      • Call hadoop command line util: $ hadoop
      • Hadoop command line options are listed here: http://hadoop.apache.org/common/docs/r0.17.0/hdfs_shell.html
      • Hadoop should have been launched on boot. verify this is the case: $ hadoop dfs -ls /
    7. Saying “hi” to Hadoop
      • If hadoop has not been started, you'll see something like: "09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 0 time(s). 09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 1 time(s)...”
      • If hadoop has been launched, the dfs -ls command should show the contents of hdfs
      • Before continuing, view all the hadoop utilities and sample files: $ ls
    8. Install Apache
      • Why? In the interest of creating a relevant example, I'm going to work on Apache access logs
      • Update apt-get so it can find apache2: $ sudo apt-get update
      • Install apache2 so we can generate access log data: $ sudo apt-get install apache2
    9. Generate data
      • Jump into the directory containing the apache logs: $ cd /var/log/apache2
      • Show the top n lines of the access log: $ tail -f -n 10 access.log
    10. Generate data
      • Put this script, or something similar, in an executable file on your local machine:
        • #!/bin/bash
        • url='http://{VM IP address}:’
        • for i in {1..1000}
        • do
        • curl $url
        • done
      • Edit the IP address to that of your VM
    11. Generate data
      • Set executable permissions on the file: $ chmod +x generate.sh
      • Run the file: $ ./generate.sh
      • Note log data in tail output in VM
    12. Exploring HDFS
      • Ref: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html
      • Show home dir structure:
        • $ hadoop dfs -ls /user
        • $ hadoop dfs -ls /user/hadoop-user
      • Create a directory: $ hadoop dfs -mkdir /user/hadoop-user/foo
      • Show new dir: $ hadoop dfs -ls /user/hadoop-user/
    13. Exploring HDFS
      • Attempt to re-create new dir and note error: $ hadoop dfs -mkdir /user/hadoop-user/foo
      • Create a destination directory using implicit path: $ hadoop dfs -mkdir bar
      • Auto-create nested destination directories: $ hadoop dfs -mkdir dir1/dir2/dir3
      • Remove dir: $ hadoop dfs -rmr /user/hadoop-user/foo
      • Remove dir: $ hadoop dfs -rmr bar dir1
      • Try to re-remove dir and note error: $ hadoop dfs -rmr bar
    14. Browse HDFS using web UI
      • Open http://{VM IP address}:500750 in browser
      • More info: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_user_guide.html#Web+Interface
    15. Import access log data
      • Load access log into hdfs: $ hadoop dfs -put /var/log/apache2/access.log input/access.log
      • Verify it's in there: $ hadoop dfs -ls input/access.log
      • View the contents: $ hadoop dfs -cat input/access.log
    16. Do something w/ the data
      • Ref: http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
      • Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python
      • Save the mapper and reducer code in two separate files, e.g., mapper.py and reducer.py
    17. Do something w/ the data
      • Stream data through these two files, saving the output back to HDFS: #!/bin/bash $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.18.0-streaming.jar -mapper /home/hadoop-user/wordcount/mapper.py -reducer /home/hadoop-user/wordcount/reducer.py -input /user/hadoop-user/input/access.log -output /user/hadoop-user/output/mapReduceOut
    18. Do something w/ the data
      • View output files: $ hadoop dfs -ls output/mapReduceOut
      • Note multiple output files ("part-00000", "part-00001", etc)
      • View output file contents: $ hadoop dfs -cat output/mapReduceOut/part-00000
    19. Pig
      • Pig is a higher-level interface for hadoop
        • Interactive shell Grunt
        • Declarative, SQL-like language, Pig Latin
        • Pig engine compiles Pig Latin into MapReduce
        • Extensible via Java files
      • "writing mapreduce routines, is like coding in assembly”
      • Pig, Hive, etc.
    20. Exploring Pig
      • Ref: http://wiki.apache.org/pig/PigTutorial
      • Pig is already on the VM
      • Launch pig w/ connection to cluster: $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main
      • View contents of HDFS from grunt: > ls
    21. Perform word count w/ Pig
      • Save this script in a file, e.g, wordcount.pig: myinput = LOAD 'input/access.log' USING TextLoader(); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE($0)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); ordered = ORDER counts BY $0; STORE ordered INTO 'output/pigOut' USING PigStorage();
    22. Perform word count w/ Pig
      • Run this script: $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -f wordcount.pig
      • View output $ hadoop dfs -cat output/pigOut/part-00000
    23. Resources
      • Apache Hadoop Site
      • Apache Pig Site
      • YDN Hadoop Tutorial
        • Virtual Machine
    24. Thank you
      • Follow me on Twitter: http://twitter.com/erikeldridge
      • Find these slides on Slideshare: http://slideshare.net/erikeldridge
      • Rate this talk on SpeakerRate: http://speakerrate.com/erikeldridge

    + Erik EldridgeErik Eldridge, 4 months ago

    custom

    1107 views, 5 favs, 1 embeds more stats

    Intention:
    approach hadoop from a tool-user's pers more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1107
      • 1106 on SlideShare
      • 1 from embeds
    • Comments 0
    • Favorites 5
    • Downloads 32
    Most viewed embeds
    • 1 views on http://twittums.com

    more

    All embeds
    • 1 views on http://twittums.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories