A brief, hands-on introduction to Hadoop & Pig  Erik Eldridge Yahoo! Developer Network OSCAMP 2009 Photo credit: http://ww...
Preamble <ul><li>Intention: approach hadoop from a tool-user's perspective, specifically, a web dev's perspective </li></u...
Requirements <ul><li>VMWare </li></ul><ul><ul><li>Hadoop will be demonstrated using a VMWare virtual machine </li></ul></u...
Setup VM <ul><li>Get hadoop vm from:  http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup </li></ul><ul><li>N...
Start Hadoop <ul><li>Run the util to launch hadoop:  $ ~/start-hadoop </li></ul><ul><li>If it's already running, we'll get...
Saying “hi” to Hadoop <ul><li>Call hadoop command line util:  $ hadoop </li></ul><ul><li>Hadoop command line options are l...
Saying “hi” to Hadoop <ul><li>If hadoop has not been started, you'll see something like: &quot;09/07/22 10:01:24 INFO ipc....
Install Apache <ul><li>Why?  In the interest of creating a relevant example, I'm going to work on Apache access logs </li>...
Generate data <ul><li>Jump into the directory containing the apache logs:  $ cd /var/log/apache2 </li></ul><ul><li>Show th...
Generate data <ul><li>Put this script, or something similar, in an executable file on your local machine: </li></ul><ul><u...
Generate data <ul><li>Set executable permissions on the file: $ chmod +x generate.sh </li></ul><ul><li>Run the file:  $ ./...
Exploring HDFS <ul><li>Ref:  http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html </li></ul><ul><li>Show home dir ...
Exploring HDFS <ul><li>Attempt to re-create new dir and note error:  $ hadoop dfs -mkdir /user/hadoop-user/foo </li></ul><...
Browse HDFS using web UI <ul><li>Open http://{VM IP address}:500750 in browser </li></ul><ul><li>More info:  http://hadoop...
Import access log data <ul><li>Load access log into hdfs:  $ hadoop dfs -put /var/log/apache2/access.log  input/access.log...
Do something w/ the data <ul><li>Ref:  http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Clus...
Do something w/ the data <ul><li>Stream data through these two files, saving the output back to HDFS: #!/bin/bash $HADOOP_...
Do something w/ the data <ul><li>View output files:  $ hadoop dfs -ls output/mapReduceOut </li></ul><ul><li>Note multiple ...
Pig <ul><li>Pig is a higher-level interface for hadoop </li></ul><ul><ul><li>Interactive shell Grunt </li></ul></ul><ul><u...
Exploring Pig <ul><li>Ref: http://wiki.apache.org/pig/PigTutorial </li></ul><ul><li>Pig is already on the VM </li></ul><ul...
Perform word count w/ Pig <ul><li>Save this script in a file, e.g, wordcount.pig: myinput = LOAD 'input/access.log' USING ...
Perform word count w/ Pig <ul><li>Run this script: $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -f wordcount...
Resources <ul><li>Apache  Hadoop  Site </li></ul><ul><li>Apache Pig Site </li></ul><ul><li>YDN  Hadoop  Tutorial   </li></...
Thank you <ul><li>Follow me on Twitter:  http://twitter.com/erikeldridge </li></ul><ul><li>Find these slides on Slideshare...
Upcoming SlideShare
Loading in...5
×

A brief, hands-on introduction to Hadoop & Pig

14,149

Published on

Intention:
approach hadoop from a tool-user's perspective, specifically, a web dev's perspective

Intended audience:
anyone with a desire to begin using Hadoop
Requirements
VMWare
Hadoop will be demonstrated using a VMWare virtual machine
I’ve found the use of a virtual machine to be the easiest way to get started with Hadoop

Published in: Technology, Sports

A brief, hands-on introduction to Hadoop & Pig

  1. 1. A brief, hands-on introduction to Hadoop & Pig Erik Eldridge Yahoo! Developer Network OSCAMP 2009 Photo credit: http://www.flickr.com/photos/mckaysavage/1059144105/sizes/l/
  2. 2. Preamble <ul><li>Intention: approach hadoop from a tool-user's perspective, specifically, a web dev's perspective </li></ul><ul><li>Intended audience: anyone with a desire to begin using Hadoop </li></ul>
  3. 3. Requirements <ul><li>VMWare </li></ul><ul><ul><li>Hadoop will be demonstrated using a VMWare virtual machine </li></ul></ul><ul><ul><li>I’ve found the use of a virtual machine to be the easiest way to get started with Hadoop </li></ul></ul>
  4. 4. Setup VM <ul><li>Get hadoop vm from: http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup </li></ul><ul><li>Note: </li></ul><ul><ul><li>user name: hadoop-user </li></ul></ul><ul><ul><li>password: hadoop </li></ul></ul><ul><li>Launch vm </li></ul><ul><li>Log in </li></ul><ul><li>Note ip of machine </li></ul>
  5. 5. Start Hadoop <ul><li>Run the util to launch hadoop: $ ~/start-hadoop </li></ul><ul><li>If it's already running, we'll get an error like &quot;172.16.83.132: datanode running as process 6752. Stop it first. 172.16.83.132: secondarynamenode running as process 6845. Stop it first. ...&quot; </li></ul>
  6. 6. Saying “hi” to Hadoop <ul><li>Call hadoop command line util: $ hadoop </li></ul><ul><li>Hadoop command line options are listed here: http://hadoop.apache.org/common/docs/r0.17.0/hdfs_shell.html </li></ul><ul><li>Hadoop should have been launched on boot. verify this is the case: $ hadoop dfs -ls / </li></ul>
  7. 7. Saying “hi” to Hadoop <ul><li>If hadoop has not been started, you'll see something like: &quot;09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 0 time(s). 09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 1 time(s)...” </li></ul><ul><li>If hadoop has been launched, the dfs -ls command should show the contents of hdfs </li></ul><ul><li>Before continuing, view all the hadoop utilities and sample files: $ ls </li></ul>
  8. 8. Install Apache <ul><li>Why? In the interest of creating a relevant example, I'm going to work on Apache access logs </li></ul><ul><li>Update apt-get so it can find apache2: $ sudo apt-get update </li></ul><ul><li>Install apache2 so we can generate access log data: $ sudo apt-get install apache2 </li></ul>
  9. 9. Generate data <ul><li>Jump into the directory containing the apache logs: $ cd /var/log/apache2 </li></ul><ul><li>Show the top n lines of the access log: $ tail -f -n 10 access.log </li></ul>
  10. 10. Generate data <ul><li>Put this script, or something similar, in an executable file on your local machine: </li></ul><ul><ul><li>#!/bin/bash </li></ul></ul><ul><ul><li>url='http://{VM IP address}:’ </li></ul></ul><ul><ul><li>for i in {1..1000} </li></ul></ul><ul><ul><li>do </li></ul></ul><ul><ul><li>curl $url </li></ul></ul><ul><ul><li>done </li></ul></ul><ul><li>Edit the IP address to that of your VM </li></ul>
  11. 11. Generate data <ul><li>Set executable permissions on the file: $ chmod +x generate.sh </li></ul><ul><li>Run the file: $ ./generate.sh </li></ul><ul><li>Note log data in tail output in VM </li></ul>
  12. 12. Exploring HDFS <ul><li>Ref: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html </li></ul><ul><li>Show home dir structure: </li></ul><ul><ul><li>$ hadoop dfs -ls /user </li></ul></ul><ul><ul><li>$ hadoop dfs -ls /user/hadoop-user </li></ul></ul><ul><li>Create a directory: $ hadoop dfs -mkdir /user/hadoop-user/foo </li></ul><ul><li>Show new dir: $ hadoop dfs -ls /user/hadoop-user/ </li></ul>
  13. 13. Exploring HDFS <ul><li>Attempt to re-create new dir and note error: $ hadoop dfs -mkdir /user/hadoop-user/foo </li></ul><ul><li>Create a destination directory using implicit path: $ hadoop dfs -mkdir bar </li></ul><ul><li>Auto-create nested destination directories: $ hadoop dfs -mkdir dir1/dir2/dir3 </li></ul><ul><li>Remove dir: $ hadoop dfs -rmr /user/hadoop-user/foo </li></ul><ul><li>Remove dir: $ hadoop dfs -rmr bar dir1 </li></ul><ul><li>Try to re-remove dir and note error: $ hadoop dfs -rmr bar </li></ul>
  14. 14. Browse HDFS using web UI <ul><li>Open http://{VM IP address}:500750 in browser </li></ul><ul><li>More info: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_user_guide.html#Web+Interface </li></ul>
  15. 15. Import access log data <ul><li>Load access log into hdfs: $ hadoop dfs -put /var/log/apache2/access.log input/access.log </li></ul><ul><li>Verify it's in there: $ hadoop dfs -ls input/access.log </li></ul><ul><li>View the contents: $ hadoop dfs -cat input/access.log </li></ul>
  16. 16. Do something w/ the data <ul><li>Ref: http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29 </li></ul><ul><li>Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python </li></ul><ul><li>Save the mapper and reducer code in two separate files, e.g., mapper.py and reducer.py </li></ul>
  17. 17. Do something w/ the data <ul><li>Stream data through these two files, saving the output back to HDFS: #!/bin/bash $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.18.0-streaming.jar -mapper /home/hadoop-user/wordcount/mapper.py -reducer /home/hadoop-user/wordcount/reducer.py -input /user/hadoop-user/input/access.log -output /user/hadoop-user/output/mapReduceOut </li></ul>
  18. 18. Do something w/ the data <ul><li>View output files: $ hadoop dfs -ls output/mapReduceOut </li></ul><ul><li>Note multiple output files (&quot;part-00000&quot;, &quot;part-00001&quot;, etc) </li></ul><ul><li>View output file contents: $ hadoop dfs -cat output/mapReduceOut/part-00000 </li></ul>
  19. 19. Pig <ul><li>Pig is a higher-level interface for hadoop </li></ul><ul><ul><li>Interactive shell Grunt </li></ul></ul><ul><ul><li>Declarative, SQL-like language, Pig Latin </li></ul></ul><ul><ul><li>Pig engine compiles Pig Latin into MapReduce </li></ul></ul><ul><ul><li>Extensible via Java files </li></ul></ul><ul><li>&quot;writing mapreduce routines, is like coding in assembly” </li></ul><ul><li>Pig, Hive, etc. </li></ul>
  20. 20. Exploring Pig <ul><li>Ref: http://wiki.apache.org/pig/PigTutorial </li></ul><ul><li>Pig is already on the VM </li></ul><ul><li>Launch pig w/ connection to cluster: $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main </li></ul><ul><li>View contents of HDFS from grunt: > ls </li></ul>
  21. 21. Perform word count w/ Pig <ul><li>Save this script in a file, e.g, wordcount.pig: myinput = LOAD 'input/access.log' USING TextLoader(); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE($0)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); ordered = ORDER counts BY $0; STORE ordered INTO 'output/pigOut' USING PigStorage(); </li></ul>
  22. 22. Perform word count w/ Pig <ul><li>Run this script: $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -f wordcount.pig </li></ul><ul><li>View output $ hadoop dfs -cat output/pigOut/part-00000 </li></ul>
  23. 23. Resources <ul><li>Apache Hadoop Site </li></ul><ul><li>Apache Pig Site </li></ul><ul><li>YDN Hadoop Tutorial </li></ul><ul><ul><li>Virtual Machine </li></ul></ul>
  24. 24. Thank you <ul><li>Follow me on Twitter: http://twitter.com/erikeldridge </li></ul><ul><li>Find these slides on Slideshare: http://slideshare.net/erikeldridge </li></ul><ul><li>Rate this talk on SpeakerRate: http://speakerrate.com/erikeldridge </li></ul>
  1. Gostou de algum slide específico?

    Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

×