Hands-on Hadoop: An intro for Web developers
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Hands-on Hadoop: An intro for Web developers

Uploaded on

Hadoop is a powerful tool for performing computation on a large amount of data using multiple computers. Getting started with Hadoop, however, is very easy. The simplest introduction uses a......

Hadoop is a powerful tool for performing computation on a large amount of data using multiple computers. Getting started with Hadoop, however, is very easy. The simplest introduction uses a virtual machine (VM) available for free from the Yahoo! Developer Network. In "Hands-on Hadoop: An intro for Web developers", we explore Hadoop using this VM. Data for the examples comes from Apache log files, generated by a simple process described in the presentation. Server log analysis is a very natural use-case for Hadoop, and, it is hoped, should convey the utility of Hadoop to a majority of Web developers. The Yahoo! VM is available here: http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Really is very interesting, I saw your website and get more details..Nice work. Thanks regards, Refer this link below, LoadRunnerTraining in Chennai
    Are you sure you want to
    Your message goes here
  • gostaria muito de ver os slides em português, parece interessante. não entendo nada em inglês.
    Are you sure you want to
    Your message goes here
  • Nice use of VM image to get people started
    Are you sure you want to
    Your message goes here
  • Abstract from Paul Tarjan’s (http://paulisageek.com) talk at the University of Waterloo: ’Hadoop: Map and Reduce in the real world At Yahoo, we have TONS of data to crawl through. Log files, Flickr Photos, Delicious Bookmarks, Yahoo answeres, Advertising bids, etc. Doing that on one machine would take forever, and writing a distributed program would have to do a TON of housekeeping (like shuttling files, splitting a job, redoing a job when a machine dies, etc). Enter Hadoop. It does all the crappy work of the distributed system for you. What you’ll learn : How to setup a hadoop image on your machine to learn, How to program in Map-Reduce, Some things I use hadoop for at Yahoo!. Materials: http://www.slideshare.net/erikeldridge/hands-on-hadoop-intro-for-web-developers-2094304 , http://blog.paulisageek.com/2009/09/hadoop-hacking-on-yahoo-ad-data.html
    Are you sure you want to
    Your message goes here
  • Note: slide 20 references an incorrect data set. The file /data/ydata/ydata-ysm-keyphrase-bid-imp-click-v1_0 was only available to users of the CMU Hack U temporary cluster. The input file should be input/access.log, loaded into hdfs in slide 15.
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 449

http://understeer.hatenablog.com 291
http://hadoopbigdata.wordpress.com 88
http://www.slideshare.net 59 5
http://webcache.googleusercontent.com 3
http://localhost 1
https://twimg0-a.akamaihd.net 1
http://hubot-clb-2081983768.ap-northeast-1.elb.amazonaws.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Hands-on Hadoop: An intro for Web developers Erik Eldridge Engineer/Evangelist Yahoo! Developer Network Photo credit: http://www.flickr.com/photos/exfordy/429414926/sizes/l/
  • 2. Goals
    • Gain familiarity with Hadoop
    • Approach Hadoop from a web dev's perspective
  • 3. Prerequisites
    • VMWare
      • Hadoop will be demonstrated using a VMWare virtual machine
      • I’ve found the use of a virtual machine to be the easiest way to get started with Hadoop
    • Curl installed
  • 4. Setup VM
    • Download VM from YDN
      • http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup
    • Note:
      • user name: hadoop-user
      • password: hadoop
    • Launch vm
    • Log in
    • Note ip of machine
  • 5. Start Hadoop
    • Run the util to launch hadoop: $ ~/start-hadoop
    • If it's already running, we'll get an error like " datanode running as process 6752. Stop it first. secondarynamenode running as process 6845. Stop it first. ..."
  • 6. Saying “hi” to Hadoop
    • Call hadoop command line util: $ hadoop
    • Hadoop should have been launched on boot. verify this is the case: $ hadoop dfs -ls /
  • 7. Saying “hi” to Hadoop
    • If hadoop has not been started, you'll see something like: "09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: / Already tried 0 time(s). 09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: / Already tried 1 time(s)...”
    • If hadoop has been launched, the dfs -ls command should show the contents of hdfs
    • Before continuing, view all the hadoop utilities and sample files: $ ls
  • 8. Install Apache
    • Why? In the interest of creating a relevant example, I'm going to work on Apache access logs
    • Update apt-get so it can find apache2: $ sudo apt-get update
    • Install apache2 so we can generate access log data: $ sudo apt-get install apache2
  • 9. Generate data
    • Jump into the directory containing the apache logs: $ cd /var/log/apache2
    • Show the top n lines of the access log: $ tail -f -n 10 access.log
  • 10. Generate data
    • Put this script, or something similar, in an executable file on your local machine
    • Edit the IP address to that of your VM
  • 11. Generate data
    • Set executable permissions on the file: $ chmod +x generate.sh
    • Run the file: $ ./generate.sh
    • Note log data in tail output in VM
  • 12. Exploring HDFS
    • Show home dir structure:
      • $ hadoop dfs -ls /user
    • Create a directory:
      • $ hadoop dfs -mkdir /user/foo
  • 13. Exploring HDFS
    • Attempt to re-create new dir and note error:
      • $ hadoop dfs -mkdir /user/foo
    • Create a destination directory using implicit path:
      • $ hadoop dfs -mkdir bar
    • Auto-create nested destination directories:
      • $ hadoop dfs -mkdir dir1/dir2/dir3
    • Remove dir:
      • $ hadoop dfs -rmr /user/foo
    • Remove dir:
      • $ hadoop dfs -rmr bar dir1
    • Try to re-remove dir and note error:
      • $ hadoop dfs -rmr bar
  • 14. Browse HDFS using web UI
    • Open http://{VM IP address}:50030 in browser
  • 15. Import access log data
    • Load access log into hdfs:
      • $ hadoop dfs -put /var/log/apache2/access.log input/access.log
    • Verify it's in there:
      • $ hadoop dfs -ls input/access.log
    • View the contents:
      • $ hadoop dfs -cat input/access.log
  • 16. Count words in data using Hadoop Streaming
    • Hadoop Streaming refers to the ability to use an arbitrary language to define a job’s map and reduce processes
  • 17. Python wordcount mapper Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python#Map:_mapper.py
  • 18. Python wordcount reducer Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python#Map:_mapper.py
  • 19. Test run mapper and reducer
    • $ cat data | mapper.py | sort | reducer.py
  • 20. Run Hadoop
    • Stream data through these two files, saving the output back to HDFS: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input /data/ydata/ydata-ysm-keyphrase-bid-imp-click-v1_0 -output /user/{username}/output -mapper /home/{username}/mapper.py -reducer /home/{username}/reducer.py
  • 21. View output
    • View output files:
      • $ hadoop dfs -ls output/mapReduceOut
    • Note multiple output files ("part-00000", "part-00001", etc)
    • View output file contents:
      • $ hadoop dfs -cat output/mapReduceOut/part-00000
  • 22. Pig
    • Pig is a higher-level interface for hadoop
      • Interactive shell Grunt
      • Declarative, SQL-like language, Pig Latin
      • Pig engine compiles Pig Latin into MapReduce
      • Extensible via Java files
    • "writing mapreduce routines, is like coding in assembly”
    • Pig, Hive, etc.
  • 23. Exploring Pig
    • Pig is already on the VM
    • Launch pig w/ connection to cluster:
      • $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main
    • View contents of HDFS from grunt:
      • > ls
  • 24. Pig word count
    • Save this script in a file, e.g, wordcount.pig:
    • myinput = LOAD 'input/access.log' USING TextLoader();
    • words = FOREACH myinput GENERATE FLATTEN(TOKENIZE($0));
    • grouped = GROUP words BY $0;
    • counts = FOREACH grouped GENERATE group, COUNT(words);
    • ordered = ORDER counts BY $0;
    • STORE ordered INTO 'output/pigOut' USING PigStorage();
  • 25. Perform word count w/ Pig
    • Run this script: $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -f wordcount.pig
    • View output $ hadoop dfs -cat output/pigOut/part-00000
  • 26. Resources
    • Apache Hadoop Site
      • hadoop.apache.org
    • Apache Pig Site
      • hadoop.apache.org/pig/
    • YDN Hadoop Tutorial
      • developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup
    • Michael G Noll’s tutorial:
      • www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
  • 27. Thank you
    • Follow me on Twitter:
      • http://twitter.com/erikeldridge
    • Find these slides on Slideshare:
      • http://slideshare.net/erikeldridge
    • Feedback? Suggestions?
      • http://speakerrate.com/erikeldridge