A brief, hands-on introduction to Hadoop & Pig
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

A brief, hands-on introduction to Hadoop & Pig

  • 17,222 views
Uploaded on

Intention: ...

Intention:
approach hadoop from a tool-user's perspective, specifically, a web dev's perspective

Intended audience:
anyone with a desire to begin using Hadoop
Requirements
VMWare
Hadoop will be demonstrated using a VMWare virtual machine
I’ve found the use of a virtual machine to be the easiest way to get started with Hadoop

More in: Technology , Sports
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
17,222
On Slideshare
17,165
From Embeds
57
Number of Embeds
5

Actions

Shares
Downloads
559
Comments
0
Likes
23

Embeds 57

http://www.slideshare.net 47
http://blog.iband.kr 7
http://twittums.com 1
http://twitter.com 1
http://devpub.kr 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A brief, hands-on introduction to Hadoop & Pig Erik Eldridge Yahoo! Developer Network OSCAMP 2009 Photo credit: http://www.flickr.com/photos/mckaysavage/1059144105/sizes/l/
  • 2. Preamble
    • Intention: approach hadoop from a tool-user's perspective, specifically, a web dev's perspective
    • Intended audience: anyone with a desire to begin using Hadoop
  • 3. Requirements
    • VMWare
      • Hadoop will be demonstrated using a VMWare virtual machine
      • I’ve found the use of a virtual machine to be the easiest way to get started with Hadoop
  • 4. Setup VM
    • Get hadoop vm from: http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup
    • Note:
      • user name: hadoop-user
      • password: hadoop
    • Launch vm
    • Log in
    • Note ip of machine
  • 5. Start Hadoop
    • Run the util to launch hadoop: $ ~/start-hadoop
    • If it's already running, we'll get an error like "172.16.83.132: datanode running as process 6752. Stop it first. 172.16.83.132: secondarynamenode running as process 6845. Stop it first. ..."
  • 6. Saying “hi” to Hadoop
    • Call hadoop command line util: $ hadoop
    • Hadoop command line options are listed here: http://hadoop.apache.org/common/docs/r0.17.0/hdfs_shell.html
    • Hadoop should have been launched on boot. verify this is the case: $ hadoop dfs -ls /
  • 7. Saying “hi” to Hadoop
    • If hadoop has not been started, you'll see something like: "09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 0 time(s). 09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 1 time(s)...”
    • If hadoop has been launched, the dfs -ls command should show the contents of hdfs
    • Before continuing, view all the hadoop utilities and sample files: $ ls
  • 8. Install Apache
    • Why? In the interest of creating a relevant example, I'm going to work on Apache access logs
    • Update apt-get so it can find apache2: $ sudo apt-get update
    • Install apache2 so we can generate access log data: $ sudo apt-get install apache2
  • 9. Generate data
    • Jump into the directory containing the apache logs: $ cd /var/log/apache2
    • Show the top n lines of the access log: $ tail -f -n 10 access.log
  • 10. Generate data
    • Put this script, or something similar, in an executable file on your local machine:
      • #!/bin/bash
      • url='http://{VM IP address}:’
      • for i in {1..1000}
      • do
      • curl $url
      • done
    • Edit the IP address to that of your VM
  • 11. Generate data
    • Set executable permissions on the file: $ chmod +x generate.sh
    • Run the file: $ ./generate.sh
    • Note log data in tail output in VM
  • 12. Exploring HDFS
    • Ref: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html
    • Show home dir structure:
      • $ hadoop dfs -ls /user
      • $ hadoop dfs -ls /user/hadoop-user
    • Create a directory: $ hadoop dfs -mkdir /user/hadoop-user/foo
    • Show new dir: $ hadoop dfs -ls /user/hadoop-user/
  • 13. Exploring HDFS
    • Attempt to re-create new dir and note error: $ hadoop dfs -mkdir /user/hadoop-user/foo
    • Create a destination directory using implicit path: $ hadoop dfs -mkdir bar
    • Auto-create nested destination directories: $ hadoop dfs -mkdir dir1/dir2/dir3
    • Remove dir: $ hadoop dfs -rmr /user/hadoop-user/foo
    • Remove dir: $ hadoop dfs -rmr bar dir1
    • Try to re-remove dir and note error: $ hadoop dfs -rmr bar
  • 14. Browse HDFS using web UI
    • Open http://{VM IP address}:500750 in browser
    • More info: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_user_guide.html#Web+Interface
  • 15. Import access log data
    • Load access log into hdfs: $ hadoop dfs -put /var/log/apache2/access.log input/access.log
    • Verify it's in there: $ hadoop dfs -ls input/access.log
    • View the contents: $ hadoop dfs -cat input/access.log
  • 16. Do something w/ the data
    • Ref: http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
    • Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python
    • Save the mapper and reducer code in two separate files, e.g., mapper.py and reducer.py
  • 17. Do something w/ the data
    • Stream data through these two files, saving the output back to HDFS: #!/bin/bash $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.18.0-streaming.jar -mapper /home/hadoop-user/wordcount/mapper.py -reducer /home/hadoop-user/wordcount/reducer.py -input /user/hadoop-user/input/access.log -output /user/hadoop-user/output/mapReduceOut
  • 18. Do something w/ the data
    • View output files: $ hadoop dfs -ls output/mapReduceOut
    • Note multiple output files ("part-00000", "part-00001", etc)
    • View output file contents: $ hadoop dfs -cat output/mapReduceOut/part-00000
  • 19. Pig
    • Pig is a higher-level interface for hadoop
      • Interactive shell Grunt
      • Declarative, SQL-like language, Pig Latin
      • Pig engine compiles Pig Latin into MapReduce
      • Extensible via Java files
    • "writing mapreduce routines, is like coding in assembly”
    • Pig, Hive, etc.
  • 20. Exploring Pig
    • Ref: http://wiki.apache.org/pig/PigTutorial
    • Pig is already on the VM
    • Launch pig w/ connection to cluster: $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main
    • View contents of HDFS from grunt: > ls
  • 21. Perform word count w/ Pig
    • Save this script in a file, e.g, wordcount.pig: myinput = LOAD 'input/access.log' USING TextLoader(); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE($0)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); ordered = ORDER counts BY $0; STORE ordered INTO 'output/pigOut' USING PigStorage();
  • 22. Perform word count w/ Pig
    • Run this script: $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -f wordcount.pig
    • View output $ hadoop dfs -cat output/pigOut/part-00000
  • 23. Resources
    • Apache Hadoop Site
    • Apache Pig Site
    • YDN Hadoop Tutorial
      • Virtual Machine
  • 24. Thank you
    • Follow me on Twitter: http://twitter.com/erikeldridge
    • Find these slides on Slideshare: http://slideshare.net/erikeldridge
    • Rate this talk on SpeakerRate: http://speakerrate.com/erikeldridge