Your SlideShare is downloading. ×
0
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Hands-on Hadoop: An intro for Web developers
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hands-on Hadoop: An intro for Web developers

9,120

Published on

Hadoop is a powerful tool for performing computation on a large amount of data using multiple computers. Getting started with Hadoop, however, is very easy. The simplest introduction uses a virtual …

Hadoop is a powerful tool for performing computation on a large amount of data using multiple computers. Getting started with Hadoop, however, is very easy. The simplest introduction uses a virtual machine (VM) available for free from the Yahoo! Developer Network. In "Hands-on Hadoop: An intro for Web developers", we explore Hadoop using this VM. Data for the examples comes from Apache log files, generated by a simple process described in the presentation. Server log analysis is a very natural use-case for Hadoop, and, it is hoped, should convey the utility of Hadoop to a majority of Web developers. The Yahoo! VM is available here: http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup

Published in: Technology, Business
6 Comments
13 Likes
Statistics
Notes
No Downloads
Views
Total Views
9,120
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
450
Comments
6
Likes
13
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hands-on Hadoop: An intro for Web developers Erik Eldridge Engineer/Evangelist Yahoo! Developer Network Photo credit: http://www.flickr.com/photos/exfordy/429414926/sizes/l/
  • 2. Goals
    • Gain familiarity with Hadoop
    • Approach Hadoop from a web dev's perspective
  • 3. Prerequisites
    • VMWare
      • Hadoop will be demonstrated using a VMWare virtual machine
      • I’ve found the use of a virtual machine to be the easiest way to get started with Hadoop
    • Curl installed
  • 4. Setup VM
    • Download VM from YDN
      • http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup
    • Note:
      • user name: hadoop-user
      • password: hadoop
    • Launch vm
    • Log in
    • Note ip of machine
  • 5. Start Hadoop
    • Run the util to launch hadoop: $ ~/start-hadoop
    • If it's already running, we'll get an error like "172.16.83.132: datanode running as process 6752. Stop it first. 172.16.83.132: secondarynamenode running as process 6845. Stop it first. ..."
  • 6. Saying “hi” to Hadoop
    • Call hadoop command line util: $ hadoop
    • Hadoop should have been launched on boot. verify this is the case: $ hadoop dfs -ls /
  • 7. Saying “hi” to Hadoop
    • If hadoop has not been started, you'll see something like: "09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 0 time(s). 09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 1 time(s)...”
    • If hadoop has been launched, the dfs -ls command should show the contents of hdfs
    • Before continuing, view all the hadoop utilities and sample files: $ ls
  • 8. Install Apache
    • Why? In the interest of creating a relevant example, I'm going to work on Apache access logs
    • Update apt-get so it can find apache2: $ sudo apt-get update
    • Install apache2 so we can generate access log data: $ sudo apt-get install apache2
  • 9. Generate data
    • Jump into the directory containing the apache logs: $ cd /var/log/apache2
    • Show the top n lines of the access log: $ tail -f -n 10 access.log
  • 10. Generate data
    • Put this script, or something similar, in an executable file on your local machine
    • Edit the IP address to that of your VM
  • 11. Generate data
    • Set executable permissions on the file: $ chmod +x generate.sh
    • Run the file: $ ./generate.sh
    • Note log data in tail output in VM
  • 12. Exploring HDFS
    • Show home dir structure:
      • $ hadoop dfs -ls /user
    • Create a directory:
      • $ hadoop dfs -mkdir /user/foo
  • 13. Exploring HDFS
    • Attempt to re-create new dir and note error:
      • $ hadoop dfs -mkdir /user/foo
    • Create a destination directory using implicit path:
      • $ hadoop dfs -mkdir bar
    • Auto-create nested destination directories:
      • $ hadoop dfs -mkdir dir1/dir2/dir3
    • Remove dir:
      • $ hadoop dfs -rmr /user/foo
    • Remove dir:
      • $ hadoop dfs -rmr bar dir1
    • Try to re-remove dir and note error:
      • $ hadoop dfs -rmr bar
  • 14. Browse HDFS using web UI
    • Open http://{VM IP address}:50030 in browser
  • 15. Import access log data
    • Load access log into hdfs:
      • $ hadoop dfs -put /var/log/apache2/access.log input/access.log
    • Verify it's in there:
      • $ hadoop dfs -ls input/access.log
    • View the contents:
      • $ hadoop dfs -cat input/access.log
  • 16. Count words in data using Hadoop Streaming
    • Hadoop Streaming refers to the ability to use an arbitrary language to define a job’s map and reduce processes
  • 17. Python wordcount mapper Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python#Map:_mapper.py
  • 18. Python wordcount reducer Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python#Map:_mapper.py
  • 19. Test run mapper and reducer
    • $ cat data | mapper.py | sort | reducer.py
  • 20. Run Hadoop
    • Stream data through these two files, saving the output back to HDFS: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input /data/ydata/ydata-ysm-keyphrase-bid-imp-click-v1_0 -output /user/{username}/output -mapper /home/{username}/mapper.py -reducer /home/{username}/reducer.py
  • 21. View output
    • View output files:
      • $ hadoop dfs -ls output/mapReduceOut
    • Note multiple output files ("part-00000", "part-00001", etc)
    • View output file contents:
      • $ hadoop dfs -cat output/mapReduceOut/part-00000
  • 22. Pig
    • Pig is a higher-level interface for hadoop
      • Interactive shell Grunt
      • Declarative, SQL-like language, Pig Latin
      • Pig engine compiles Pig Latin into MapReduce
      • Extensible via Java files
    • "writing mapreduce routines, is like coding in assembly”
    • Pig, Hive, etc.
  • 23. Exploring Pig
    • Pig is already on the VM
    • Launch pig w/ connection to cluster:
      • $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main
    • View contents of HDFS from grunt:
      • > ls
  • 24. Pig word count
    • Save this script in a file, e.g, wordcount.pig:
    • myinput = LOAD 'input/access.log' USING TextLoader();
    • words = FOREACH myinput GENERATE FLATTEN(TOKENIZE($0));
    • grouped = GROUP words BY $0;
    • counts = FOREACH grouped GENERATE group, COUNT(words);
    • ordered = ORDER counts BY $0;
    • STORE ordered INTO 'output/pigOut' USING PigStorage();
  • 25. Perform word count w/ Pig
    • Run this script: $ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -f wordcount.pig
    • View output $ hadoop dfs -cat output/pigOut/part-00000
  • 26. Resources
    • Apache Hadoop Site
      • hadoop.apache.org
    • Apache Pig Site
      • hadoop.apache.org/pig/
    • YDN Hadoop Tutorial
      • developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup
    • Michael G Noll’s tutorial:
      • www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
  • 27. Thank you
    • Follow me on Twitter:
      • http://twitter.com/erikeldridge
    • Find these slides on Slideshare:
      • http://slideshare.net/erikeldridge
    • Feedback? Suggestions?
      • http://speakerrate.com/erikeldridge

×