Abstract from Paul Tarjan’s (http://paulisageek.com) talk at the University of Waterloo: ’Hadoop: Map and Reduce in the real world At Yahoo, we have TONS of data to crawl through. Log files, Flickr Photos, Delicious Bookmarks, Yahoo answeres, Advertising bids, etc. Doing that on one machine would take forever, and writing a distributed program would have to do a TON of housekeeping (like shuttling files, splitting a job, redoing a job when a machine dies, etc). Enter Hadoop. It does all the crappy work of the distributed system for you. What you’ll learn : How to setup a hadoop image on your machine to learn, How to program in Map-Reduce, Some things I use hadoop for at Yahoo!. Materials: http://www.slideshare.net/erikeldridge/hands-on-hadoop-intro-for-web-developers-2094304 , http://blog.paulisageek.com/2009/09/hadoop-hacking-on-yahoo-ad-data.html ’
Note: slide 20 references an incorrect data set. The file /data/ydata/ydata-ysm-keyphrase-bid-imp-click-v1_0 was only available to users of the CMU Hack U temporary cluster. The input file should be input/access.log, loaded into hdfs in slide 15.
Hands-on Hadoop: An intro for Web developers - Presentation Transcript
Hands-on Hadoop: An intro for Web developers Erik Eldridge Engineer/Evangelist Yahoo! Developer Network Photo credit: http://www.flickr.com/photos/exfordy/429414926/sizes/l/
Goals
Gain familiarity with Hadoop
Approach Hadoop from a web dev's perspective
Prerequisites
VMWare
Hadoop will be demonstrated using a VMWare virtual machine
I’ve found the use of a virtual machine to be the easiest way to get started with Hadoop
If it's already running, we'll get an error like "172.16.83.132: datanode running as process 6752. Stop it first. 172.16.83.132: secondarynamenode running as process 6845. Stop it first. ..."
Saying “hi” to Hadoop
Call hadoop command line util: $ hadoop
Hadoop should have been launched on boot. verify this is the case: $ hadoop dfs -ls /
Saying “hi” to Hadoop
If hadoop has not been started, you'll see something like: "09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 0 time(s). 09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 1 time(s)...”
If hadoop has been launched, the dfs -ls command should show the contents of hdfs
Before continuing, view all the hadoop utilities and sample files: $ ls
Install Apache
Why? In the interest of creating a relevant example, I'm going to work on Apache access logs
Update apt-get so it can find apache2: $ sudo apt-get update
Install apache2 so we can generate access log data: $ sudo apt-get install apache2
Generate data
Jump into the directory containing the apache logs: $ cd /var/log/apache2
Show the top n lines of the access log: $ tail -f -n 10 access.log
Generate data
Put this script, or something similar, in an executable file on your local machine
Edit the IP address to that of your VM
Generate data
Set executable permissions on the file: $ chmod +x generate.sh
Run the file: $ ./generate.sh
Note log data in tail output in VM
Exploring HDFS
Show home dir structure:
$ hadoop dfs -ls /user
Create a directory:
$ hadoop dfs -mkdir /user/foo
Exploring HDFS
Attempt to re-create new dir and note error:
$ hadoop dfs -mkdir /user/foo
Create a destination directory using implicit path:
Stream data through these two files, saving the output back to HDFS: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input /data/ydata/ydata-ysm-keyphrase-bid-imp-click-v1_0 -output /user/{username}/output -mapper /home/{username}/mapper.py -reducer /home/{username}/reducer.py
Hadoop is a powerful tool for performing computatio more
Hadoop is a powerful tool for performing computation on a large amount of data using multiple computers. Getting started with Hadoop, however, is very easy. The simplest introduction uses a virtual machine (VM) available for free from the Yahoo! Developer Network. In "Hands-on Hadoop: An intro for Web developers", we explore Hadoop using this VM. Data for the examples comes from Apache log files, generated by a simple process described in the presentation. Server log analysis is a very natural use-case for Hadoop, and, it is hoped, should convey the utility of Hadoop to a majority of Web developers. The Yahoo! VM is available here: http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup less
2 comments
Comments 1 - 2 of 2 previous next Post a comment