Hands-on Hadoop: An intro for Web developers

9,530 views
9,439 views

Published on

Hadoop is a powerful tool for performing computation on a large amount of data using multiple computers. Getting started with Hadoop, however, is very easy. The simplest introduction uses a virtual machine (VM) available for free from the Yahoo! Developer Network. In "Hands-on Hadoop: An intro for Web developers", we explore Hadoop using this VM. Data for the examples comes from Apache log files, generated by a simple process described in the presentation. Server log analysis is a very natural use-case for Hadoop, and, it is hoped, should convey the utility of Hadoop to a majority of Web developers. The Yahoo! VM is available here: http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup

Published in: Technology, Business
6 Comments
13 Likes
Statistics
Notes
No Downloads
Views
Total views
9,530
On SlideShare
0
From Embeds
0
Number of Embeds
460
Actions
Shares
0
Downloads
453
Comments
6
Likes
13
Embeds 0
No embeds

No notes for slide

Hands-on Hadoop: An intro for Web developers

  1. 1. Hands-on Hadoop: An intro for Web developers Erik Eldridge Engineer/Evangelist Yahoo! Developer Network Photo credit: http://www.flickr.com/photos/exfordy/429414926/sizes/l/
  2. 2. Goals <ul><li>Gain familiarity with Hadoop </li></ul><ul><li>Approach Hadoop from a web dev's perspective </li></ul>
  3. 3. Prerequisites <ul><li>VMWare </li></ul><ul><ul><li>Hadoop will be demonstrated using a VMWare virtual machine </li></ul></ul><ul><ul><li>I’ve found the use of a virtual machine to be the easiest way to get started with Hadoop </li></ul></ul><ul><li>Curl installed </li></ul>
  4. 4. Setup VM <ul><li>Download VM from YDN </li></ul><ul><ul><li>http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup </li></ul></ul><ul><li>Note: </li></ul><ul><ul><li>user name: hadoop-user </li></ul></ul><ul><ul><li>password: hadoop </li></ul></ul><ul><li>Launch vm </li></ul><ul><li>Log in </li></ul><ul><li>Note ip of machine </li></ul>
  5. 5. Start Hadoop <ul><li>Run the util to launch hadoop: $ ~/start-hadoop </li></ul><ul><li>If it's already running, we'll get an error like &quot;172.16.83.132: datanode running as process 6752. Stop it first. 172.16.83.132: secondarynamenode running as process 6845. Stop it first. ...&quot; </li></ul>
  6. 6. Saying “hi” to Hadoop <ul><li>Call hadoop command line util: $ hadoop </li></ul><ul><li>Hadoop should have been launched on boot. verify this is the case: $ hadoop dfs -ls / </li></ul>
  7. 7. Saying “hi” to Hadoop <ul><li>If hadoop has not been started, you'll see something like: &quot;09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 0 time(s). 09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 1 time(s)...” </li></ul><ul><li>If hadoop has been launched, the dfs -ls command should show the contents of hdfs </li></ul><ul><li>Before continuing, view all the hadoop utilities and sample files: $ ls </li></ul>
  8. 8. Install Apache <ul><li>Why? In the interest of creating a relevant example, I'm going to work on Apache access logs </li></ul><ul><li>Update apt-get so it can find apache2: $ sudo apt-get update </li></ul><ul><li>Install apache2 so we can generate access log data: $ sudo apt-get install apache2 </li></ul>
  9. 9. Generate data <ul><li>Jump into the directory containing the apache logs: $ cd /var/log/apache2 </li></ul><ul><li>Show the top n lines of the access log: $ tail -f -n 10 access.log </li></ul>
  10. 10. Generate data <ul><li>Put this script, or something similar, in an executable file on your local machine </li></ul><ul><li>Edit the IP address to that of your VM </li></ul>
  11. 11. Generate data <ul><li>Set executable permissions on the file: $ chmod +x generate.sh </li></ul><ul><li>Run the file: $ ./generate.sh </li></ul><ul><li>Note log data in tail output in VM </li></ul>
  12. 12. Exploring HDFS <ul><li>Show home dir structure: </li></ul><ul><ul><li>$ hadoop dfs -ls /user </li></ul></ul><ul><li>Create a directory: </li></ul><ul><ul><li>$ hadoop dfs -mkdir /user/foo </li></ul></ul>
  13. 13. Exploring HDFS <ul><li>Attempt to re-create new dir and note error: </li></ul><ul><ul><li>$ hadoop dfs -mkdir /user/foo </li></ul></ul><ul><li>Create a destination directory using implicit path: </li></ul><ul><ul><li>$ hadoop dfs -mkdir bar </li></ul></ul><ul><li>Auto-create nested destination directories: </li></ul><ul><ul><li>$ hadoop dfs -mkdir dir1/dir2/dir3 </li></ul></ul><ul><li>Remove dir: </li></ul><ul><ul><li>$ hadoop dfs -rmr /user/foo </li></ul></ul><ul><li>Remove dir: </li></ul><ul><ul><li>$ hadoop dfs -rmr bar dir1 </li></ul></ul><ul><li>Try to re-remove dir and note error: </li></ul><ul><ul><li>$ hadoop dfs -rmr bar </li></ul></ul>
  14. 14. Browse HDFS using web UI <ul><li>Open http://{VM IP address}:50030 in browser </li></ul>
  15. 15. Import access log data <ul><li>Load access log into hdfs: </li></ul><ul><ul><li>$ hadoop dfs -put /var/log/apache2/access.log input/access.log </li></ul></ul><ul><li>Verify it's in there: </li></ul><ul><ul><li>$ hadoop dfs -ls input/access.log </li></ul></ul><ul><li>View the contents: </li></ul><ul><ul><li>$ hadoop dfs -cat input/access.log </li></ul></ul>
  16. 16. Count words in data using Hadoop Streaming <ul><li>Hadoop Streaming refers to the ability to use an arbitrary language to define a job’s map and reduce processes </li></ul>
  17. 17. Python wordcount mapper Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python#Map:_mapper.py
  18. 18. Python wordcount reducer Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python#Map:_mapper.py
  19. 19. Test run mapper and reducer <ul><li>$ cat data | mapper.py | sort | reducer.py </li></ul>
  20. 20. Run Hadoop <ul><li>Stream data through these two files, saving the output back to HDFS: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input /data/ydata/ydata-ysm-keyphrase-bid-imp-click-v1_0 -output /user/{username}/output -mapper /home/{username}/mapper.py -reducer /home/{username}/reducer.py </li></ul>
  21. 21. View output <ul><li>View output files: </li></ul><ul><ul><li>$ hadoop dfs -ls output/mapReduceOut </li></ul></ul><ul><li>Note multiple output files (&quot;part-00000&quot;, &quot;part-00001&quot;, etc) </li></ul><ul><li>View output file contents: </li></ul><ul><ul><li>$ hadoop dfs -cat output/mapReduceOut/part-00000 </li></ul></ul>
  22. 22. Pig <ul><li>Pig is a higher-level interface for hadoop </li></ul><ul><ul><li>Interactive shell Grunt </li></ul></ul><ul><ul><li>Declarative, SQL-like language, Pig Latin </li></ul></ul><ul><ul><li>Pig engine compiles Pig Latin into MapReduce </li></ul></ul><ul><ul><li>Extensible via Java files </li></ul></ul><ul><li>&quot;writing mapreduce routines, is like coding in assembly” </li></ul><ul><li>Pig, Hive, etc. </li></ul>
  23. 23. Exploring Pig <ul><li>Pig is already on the VM </li></ul><ul><li>Launch pig w/ connection to cluster: </li></ul><ul><ul><li>$ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main </li></ul></ul><ul><li>View contents of HDFS from grunt: </li></ul><ul><ul><li>> ls </li></ul></ul>
  24. 24. Pig word count <ul><li>Save this script in a file, e.g, wordcount.pig: </li></ul><ul><li>myinput = LOAD 'input/access.log' USING TextLoader(); </li></ul><ul><li>words = FOREACH myinput GENERATE FLATTEN(TOKENIZE($0)); </li></ul><ul><li>grouped = GROUP words BY $0; </li></ul><ul><li>counts = FOREACH grouped GENERATE group, COUNT(words); </li></ul><ul><li>ordered = ORDER counts BY $0; </li></ul><ul><li>STORE ordered INTO 'output/pigOut' USING PigStorage(); </li></ul>
  25. 25. Perform word count w/ Pig <ul><li>Run this script: $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -f wordcount.pig </li></ul><ul><li>View output $ hadoop dfs -cat output/pigOut/part-00000 </li></ul>
  26. 26. Resources <ul><li>Apache Hadoop Site </li></ul><ul><ul><li>hadoop.apache.org </li></ul></ul><ul><li>Apache Pig Site </li></ul><ul><ul><li>hadoop.apache.org/pig/ </li></ul></ul><ul><li>YDN Hadoop Tutorial </li></ul><ul><ul><li>developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup </li></ul></ul><ul><li>Michael G Noll’s tutorial: </li></ul><ul><ul><li>www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29 </li></ul></ul>
  27. 27. Thank you <ul><li>Follow me on Twitter: </li></ul><ul><ul><li>http://twitter.com/erikeldridge </li></ul></ul><ul><li>Find these slides on Slideshare: </li></ul><ul><ul><li>http://slideshare.net/erikeldridge </li></ul></ul><ul><li>Feedback? Suggestions? </li></ul><ul><ul><li>http://speakerrate.com/erikeldridge </li></ul></ul>

×