Hands-on Hadoop: Using the Yahoo! Hack U cluster
Upcoming SlideShare
Loading in...5
×
 

Hands-on Hadoop: Using the Yahoo! Hack U cluster

on

  • 2,730 views

 

Statistics

Views

Total Views
2,730
Views on SlideShare
2,728
Embed Views
2

Actions

Likes
0
Downloads
50
Comments
1

1 Embed 2

http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hands-on Hadoop: Using the Yahoo! Hack U cluster Hands-on Hadoop: Using the Yahoo! Hack U cluster Presentation Transcript

    • Hands on Hadoop: Using a shared EC2 cluster Erik Eldridge Engineer/Evangelist Yahoo! Developer Network
    • Prerequisites
      • SSH installed
      • Familiarity with linux command line
      • Familiarity with Hadoop
      • The following from us:
        • Username
        • SSH key
        • Cluster IP
    • Goals
      • Gain appreciation of speedup from multiple nodes
      • Gain experience with parallel processing
      • Gain experience with realistic data
    • Connecting to the Cluster
      • $ ssh {username}@{cluster ip} -i {pub key}
    • Viewing Yahoo! data
      • $ hadoop fs -ls /data/ydata
    • Running a job
      • $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input /data/ydata/ydata-ysm-keyphrase-bid-imp-click-v1_0 -output /user/{username}/output -mapper /home/{username}/mapper -reducer /home/{username}/reducer
    • Viewing output
      • $ hadoop fs -ls /user/{username}/output
      • $ hadoop cat fs -cat /user/{username}/output
    • Summary
      • Working on a multi-node cluster is very similar to working on a single node
      • The main difference is the performance gain
    • Resources
      • Apache’s Hadoop site:
        • hadoop.apache.org
      • Cloudera’s tutorial and scripts:
        • archive.cloudera.com/docs/ec2.html
      • Amazon’s EC2 documentation:
        • http://aws.amazon.com/ec2/
    • Thank you
      • Follow me on Twitter: http://twitter.com/erikeldridge
      • Find these slides on Slideshare: http://slideshare.net/erikeldridge