A brief, hands-on introduction to Hadoop & Pig
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

A brief, hands-on introduction to Hadoop & Pig

on

  • 17,059 views

Intention: ...

Intention:
approach hadoop from a tool-user's perspective, specifically, a web dev's perspective

Intended audience:
anyone with a desire to begin using Hadoop
Requirements
VMWare
Hadoop will be demonstrated using a VMWare virtual machine
I’ve found the use of a virtual machine to be the easiest way to get started with Hadoop

Statistics

Views

Total Views
17,059
Views on SlideShare
17,002
Embed Views
57

Actions

Likes
23
Downloads
557
Comments
0

5 Embeds 57

http://www.slideshare.net 47
http://blog.iband.kr 7
http://twittums.com 1
http://twitter.com 1
http://devpub.kr 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

A brief, hands-on introduction to Hadoop & Pig Presentation Transcript

  • 1. A brief, hands-on introduction to Hadoop & Pig Erik Eldridge Yahoo! Developer Network OSCAMP 2009 Photo credit: http://www.flickr.com/photos/mckaysavage/1059144105/sizes/l/
  • 2. Preamble
    • Intention: approach hadoop from a tool-user's perspective, specifically, a web dev's perspective
    • Intended audience: anyone with a desire to begin using Hadoop
  • 3. Requirements
    • VMWare
      • Hadoop will be demonstrated using a VMWare virtual machine
      • I’ve found the use of a virtual machine to be the easiest way to get started with Hadoop
  • 4. Setup VM
    • Get hadoop vm from: http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup
    • Note:
      • user name: hadoop-user
      • password: hadoop
    • Launch vm
    • Log in
    • Note ip of machine
  • 5. Start Hadoop
    • Run the util to launch hadoop: $ ~/start-hadoop
    • If it's already running, we'll get an error like "172.16.83.132: datanode running as process 6752. Stop it first. 172.16.83.132: secondarynamenode running as process 6845. Stop it first. ..."
  • 6. Saying “hi” to Hadoop
    • Call hadoop command line util: $ hadoop
    • Hadoop command line options are listed here: http://hadoop.apache.org/common/docs/r0.17.0/hdfs_shell.html
    • Hadoop should have been launched on boot. verify this is the case: $ hadoop dfs -ls /
  • 7. Saying “hi” to Hadoop
    • If hadoop has not been started, you'll see something like: "09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 0 time(s). 09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 1 time(s)...”
    • If hadoop has been launched, the dfs -ls command should show the contents of hdfs
    • Before continuing, view all the hadoop utilities and sample files: $ ls
  • 8. Install Apache
    • Why? In the interest of creating a relevant example, I'm going to work on Apache access logs
    • Update apt-get so it can find apache2: $ sudo apt-get update
    • Install apache2 so we can generate access log data: $ sudo apt-get install apache2
  • 9. Generate data
    • Jump into the directory containing the apache logs: $ cd /var/log/apache2
    • Show the top n lines of the access log: $ tail -f -n 10 access.log
  • 10. Generate data
    • Put this script, or something similar, in an executable file on your local machine:
      • #!/bin/bash
      • url='http://{VM IP address}:’
      • for i in {1..1000}
      • do
      • curl $url
      • done
    • Edit the IP address to that of your VM
  • 11. Generate data
    • Set executable permissions on the file: $ chmod +x generate.sh
    • Run the file: $ ./generate.sh
    • Note log data in tail output in VM
  • 12. Exploring HDFS
    • Ref: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html
    • Show home dir structure:
      • $ hadoop dfs -ls /user
      • $ hadoop dfs -ls /user/hadoop-user
    • Create a directory: $ hadoop dfs -mkdir /user/hadoop-user/foo
    • Show new dir: $ hadoop dfs -ls /user/hadoop-user/
  • 13. Exploring HDFS
    • Attempt to re-create new dir and note error: $ hadoop dfs -mkdir /user/hadoop-user/foo
    • Create a destination directory using implicit path: $ hadoop dfs -mkdir bar
    • Auto-create nested destination directories: $ hadoop dfs -mkdir dir1/dir2/dir3
    • Remove dir: $ hadoop dfs -rmr /user/hadoop-user/foo
    • Remove dir: $ hadoop dfs -rmr bar dir1
    • Try to re-remove dir and note error: $ hadoop dfs -rmr bar
  • 14. Browse HDFS using web UI
    • Open http://{VM IP address}:500750 in browser
    • More info: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_user_guide.html#Web+Interface
  • 15. Import access log data
    • Load access log into hdfs: $ hadoop dfs -put /var/log/apache2/access.log input/access.log
    • Verify it's in there: $ hadoop dfs -ls input/access.log
    • View the contents: $ hadoop dfs -cat input/access.log
  • 16. Do something w/ the data
    • Ref: http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
    • Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python
    • Save the mapper and reducer code in two separate files, e.g., mapper.py and reducer.py
  • 17. Do something w/ the data
    • Stream data through these two files, saving the output back to HDFS: #!/bin/bash $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.18.0-streaming.jar -mapper /home/hadoop-user/wordcount/mapper.py -reducer /home/hadoop-user/wordcount/reducer.py -input /user/hadoop-user/input/access.log -output /user/hadoop-user/output/mapReduceOut
  • 18. Do something w/ the data
    • View output files: $ hadoop dfs -ls output/mapReduceOut
    • Note multiple output files ("part-00000", "part-00001", etc)
    • View output file contents: $ hadoop dfs -cat output/mapReduceOut/part-00000
  • 19. Pig
    • Pig is a higher-level interface for hadoop
      • Interactive shell Grunt
      • Declarative, SQL-like language, Pig Latin
      • Pig engine compiles Pig Latin into MapReduce
      • Extensible via Java files
    • "writing mapreduce routines, is like coding in assembly”
    • Pig, Hive, etc.
  • 20. Exploring Pig
    • Ref: http://wiki.apache.org/pig/PigTutorial
    • Pig is already on the VM
    • Launch pig w/ connection to cluster: $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main
    • View contents of HDFS from grunt: > ls
  • 21. Perform word count w/ Pig
    • Save this script in a file, e.g, wordcount.pig: myinput = LOAD 'input/access.log' USING TextLoader(); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE($0)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); ordered = ORDER counts BY $0; STORE ordered INTO 'output/pigOut' USING PigStorage();
  • 22. Perform word count w/ Pig
    • Run this script: $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -f wordcount.pig
    • View output $ hadoop dfs -cat output/pigOut/part-00000
  • 23. Resources
    • Apache Hadoop Site
    • Apache Pig Site
    • YDN Hadoop Tutorial
      • Virtual Machine
  • 24. Thank you
    • Follow me on Twitter: http://twitter.com/erikeldridge
    • Find these slides on Slideshare: http://slideshare.net/erikeldridge
    • Rate this talk on SpeakerRate: http://speakerrate.com/erikeldridge