Hands on Hadoop: Using a shared EC2 cluster Erik Eldridge Engineer/Evangelist Yahoo! Developer Network
Prerequisites <ul><li>SSH installed </li></ul><ul><li>Familiarity with linux command line </li></ul><ul><li>Familiarity wi...
Goals <ul><li>Gain appreciation of speedup from multiple nodes </li></ul><ul><li>Gain experience with parallel processing ...
Connecting to the Cluster <ul><li>$ ssh {username}@{cluster ip} -i {pub key} </li></ul>
Viewing Yahoo! data <ul><li>$ hadoop fs -ls /data/ydata </li></ul>
Running a job <ul><li>$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar    -input /data/ydata/ydata-ysm-keyph...
Viewing output <ul><li>$ hadoop fs -ls /user/{username}/output </li></ul><ul><li>$ hadoop cat fs -cat  /user/{username}/ou...
Summary <ul><li>Working on a multi-node cluster is very similar to working on a single node </li></ul><ul><li>The main dif...
Resources <ul><li>Apache’s Hadoop site: </li></ul><ul><ul><li>hadoop.apache.org </li></ul></ul><ul><li>Cloudera’s tutorial...
Thank you <ul><li>Follow me on Twitter:  http://twitter.com/erikeldridge </li></ul><ul><li>Find these slides on Slideshare...
Hands-on Hadoop: Using the Yahoo! Hack U cluster
Upcoming SlideShare
Loading in …5
×

Hands-on Hadoop: Using the Yahoo! Hack U cluster

2,041 views
1,951 views

Published on

Published in: Technology, Business
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
2,041
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
51
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Hands-on Hadoop: Using the Yahoo! Hack U cluster

  1. 1. Hands on Hadoop: Using a shared EC2 cluster Erik Eldridge Engineer/Evangelist Yahoo! Developer Network
  2. 2. Prerequisites <ul><li>SSH installed </li></ul><ul><li>Familiarity with linux command line </li></ul><ul><li>Familiarity with Hadoop </li></ul><ul><li>The following from us: </li></ul><ul><ul><li>Username </li></ul></ul><ul><ul><li>SSH key </li></ul></ul><ul><ul><li>Cluster IP </li></ul></ul>
  3. 3. Goals <ul><li>Gain appreciation of speedup from multiple nodes </li></ul><ul><li>Gain experience with parallel processing </li></ul><ul><li>Gain experience with realistic data </li></ul>
  4. 4. Connecting to the Cluster <ul><li>$ ssh {username}@{cluster ip} -i {pub key} </li></ul>
  5. 5. Viewing Yahoo! data <ul><li>$ hadoop fs -ls /data/ydata </li></ul>
  6. 6. Running a job <ul><li>$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input /data/ydata/ydata-ysm-keyphrase-bid-imp-click-v1_0 -output /user/{username}/output -mapper /home/{username}/mapper -reducer /home/{username}/reducer </li></ul>
  7. 7. Viewing output <ul><li>$ hadoop fs -ls /user/{username}/output </li></ul><ul><li>$ hadoop cat fs -cat /user/{username}/output </li></ul>
  8. 8. Summary <ul><li>Working on a multi-node cluster is very similar to working on a single node </li></ul><ul><li>The main difference is the performance gain </li></ul>
  9. 9. Resources <ul><li>Apache’s Hadoop site: </li></ul><ul><ul><li>hadoop.apache.org </li></ul></ul><ul><li>Cloudera’s tutorial and scripts: </li></ul><ul><ul><li>archive.cloudera.com/docs/ec2.html </li></ul></ul><ul><li>Amazon’s EC2 documentation: </li></ul><ul><ul><li>http://aws.amazon.com/ec2/ </li></ul></ul>
  10. 10. Thank you <ul><li>Follow me on Twitter: http://twitter.com/erikeldridge </li></ul><ul><li>Find these slides on Slideshare: http://slideshare.net/erikeldridge </li></ul>

×