Hands-On Hadoop Tutorial Chris Sosa Wolfgang Richter May 23, 2008
General Information <ul><li>Hadoop uses HDFS, a distributed file system based on GFS, as its shared filesystem </li></ul><...
General Information (cont’d) <ul><li>Provided a script for your convenience </li></ul><ul><ul><li>Run source /localtmp/had...
Master Node <ul><li>Hadoop currently configured with centurion064 as the master node </li></ul><ul><li>Master node </li></...
Slave Nodes <ul><li>Centurion064 also acts as a slave node </li></ul><ul><li>Slave nodes </li></ul><ul><ul><li>Manage bloc...
Hadoop Paths <ul><li>Hadoop is locally “installed” on each machine </li></ul><ul><ul><li>Installed location is in /localtm...
Starting / Stopping Hadoop <ul><li>For the purposes of this tutorial, we assume you have run the setupVars from earlier </...
Using HDFS (1/2) <ul><li>hadoop dfs </li></ul><ul><ul><li>[-ls <path>] </li></ul></ul><ul><ul><li>[-du <path>] </li></ul><...
Using HDFS (2/2) <ul><li>Want to reformat? </li></ul><ul><li>Easy </li></ul><ul><ul><li>hadoop namenode –format </li></ul>...
To Add Another Slave <ul><li>This adds another data node / job execution site to the pool </li></ul><ul><ul><li>Hadoop dyn...
Configure Hadoop <ul><li>Can configure in {$installation dir}/conf </li></ul><ul><ul><li>hadoop-default.xml for global </l...
That’s it for Configuration!
Real-time Access
Upcoming SlideShare
Loading in...5
×

Hadoop Tutorial

7,122

Published on

My tutorial on using Hadoop software

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,122
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
177
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop Tutorial

  1. 1. Hands-On Hadoop Tutorial Chris Sosa Wolfgang Richter May 23, 2008
  2. 2. General Information <ul><li>Hadoop uses HDFS, a distributed file system based on GFS, as its shared filesystem </li></ul><ul><li>HDFS architecture divides files into large chunks (~64MB) distributed across data servers </li></ul><ul><li>HDFS has a global namespace </li></ul>
  3. 3. General Information (cont’d) <ul><li>Provided a script for your convenience </li></ul><ul><ul><li>Run source /localtmp/hadoop/setupVars from centurtion064 </li></ul></ul><ul><ul><li>Changes all uses of {somePath}/command to just command </li></ul></ul><ul><li>Goto http://www.cs.virginia.edu/~cbs6n/hadoop for web access. These slides and more information are also available there. </li></ul><ul><li>Once you use the DFS (put something in it), relative paths are from /usr/{your usr id}. E.G. if your id is tb28 … your “home dir” is /usr/tb28 </li></ul>
  4. 4. Master Node <ul><li>Hadoop currently configured with centurion064 as the master node </li></ul><ul><li>Master node </li></ul><ul><ul><li>Keeps track of namespace and metadata about items </li></ul></ul><ul><ul><li>Keeps track of MapReduce jobs in the system </li></ul></ul>
  5. 5. Slave Nodes <ul><li>Centurion064 also acts as a slave node </li></ul><ul><li>Slave nodes </li></ul><ul><ul><li>Manage blocks of data sent from master node </li></ul></ul><ul><ul><li>In terms of GFS, these are the chunkservers </li></ul></ul><ul><li>Currently centurion060 is also another slave node </li></ul>
  6. 6. Hadoop Paths <ul><li>Hadoop is locally “installed” on each machine </li></ul><ul><ul><li>Installed location is in /localtmp/hadoop/hadoop-0.15.3 </li></ul></ul><ul><ul><li>Slave nodes store their data in /localtmp/hadoop/hadoop-dfs (this is automatically created by the DFS) </li></ul></ul><ul><ul><li>/localtmp/hadoop is owned by group gbg (someone in this group must administer this or a cs admin) </li></ul></ul><ul><li>Files are divided into 64 MB chunks (this is configurable) </li></ul>
  7. 7. Starting / Stopping Hadoop <ul><li>For the purposes of this tutorial, we assume you have run the setupVars from earlier </li></ul><ul><li>start-all.sh – starts all slave nodes and master node </li></ul><ul><li>stop-all.sh – stops all slave nodes and master node </li></ul>
  8. 8. Using HDFS (1/2) <ul><li>hadoop dfs </li></ul><ul><ul><li>[-ls <path>] </li></ul></ul><ul><ul><li>[-du <path>] </li></ul></ul><ul><ul><li>[-cp <src> <dst>] </li></ul></ul><ul><ul><li>[-rm <path>] </li></ul></ul><ul><ul><li>[-put <localsrc> <dst>] </li></ul></ul><ul><ul><li>[-copyFromLocal <localsrc> <dst>] </li></ul></ul><ul><ul><li>[-moveFromLocal <localsrc> <dst>] </li></ul></ul><ul><ul><li>[-get [-crc] <src> <localdst>] </li></ul></ul><ul><ul><li>[-cat <src>] </li></ul></ul><ul><ul><li>[-copyToLocal [-crc] <src> <localdst>] </li></ul></ul><ul><ul><li>[-moveToLocal [-crc] <src> <localdst>] </li></ul></ul><ul><ul><li>[-mkdir <path>] </li></ul></ul><ul><ul><li>[-touchz <path>] </li></ul></ul><ul><ul><li>[-test -[ezd] <path>] </li></ul></ul><ul><ul><li>[-stat [format] <path>] </li></ul></ul><ul><ul><li>[-help [cmd]] </li></ul></ul>
  9. 9. Using HDFS (2/2) <ul><li>Want to reformat? </li></ul><ul><li>Easy </li></ul><ul><ul><li>hadoop namenode –format </li></ul></ul><ul><li>Basically we see most commands look similar </li></ul><ul><ul><li>hadoop “some command” options </li></ul></ul><ul><ul><li>If you just type hadoop you get all possible commands (including undocumented ones – hooray) </li></ul></ul>
  10. 10. To Add Another Slave <ul><li>This adds another data node / job execution site to the pool </li></ul><ul><ul><li>Hadoop dynamically uses filesystem underneath it </li></ul></ul><ul><ul><li>If more space is available on the HDD, HDFS will try to use it when it needs to </li></ul></ul><ul><li>Modify the slaves file </li></ul><ul><ul><li>In centurion064:/localtmp/hadoop/hadoop-0.15.3/conf </li></ul></ul><ul><ul><li>Copy code installation dir to newMachine:/localtmp/hadoop/hadoop-0.15.3 (very small) </li></ul></ul><ul><ul><li>Restart Hadoop </li></ul></ul>
  11. 11. Configure Hadoop <ul><li>Can configure in {$installation dir}/conf </li></ul><ul><ul><li>hadoop-default.xml for global </li></ul></ul><ul><ul><li>hadoop-site.xml for site specific (overrides global) </li></ul></ul>
  12. 12. That’s it for Configuration!
  13. 13. Real-time Access
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×