• Like
Hadoop high-level intro - U. of Mich. Hack U '09
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Hadoop high-level intro - U. of Mich. Hack U '09

  • 4,853 views
Published

This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.

This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,853
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
251
Comments
0
Likes
7

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop: A (very) high-level overview University of Michigan Hack U ’08 Erik Eldridge Yahoo! Developer Network Photo credit: Swami Stream (http://ow.ly/17tC)
  • 2. Overview
    • What is it?
    • Example 1: word count
    • Example 2: search suggestions
    • Why would I use it?
    • How do I use it?
    • Some Code
  • 3. Before I continue…
    • Slides are available here: slideshare . net/erikeldridge
  • 4. Hadoop is
    • Software for breaking a big job into smaller tasks, performing each task, and collecting the results
  • 5. Example 1: Counting Words
    • Split into 3 sentences
    • Count words in each sentence
      • 1 “Mary”, 1 “had”, 1 “a”, …
      • 1 “It’s”, 1 “fleece”, 1 “was”, …
      • 1 “Everywhere”, 1 “that”, 1 “Mary”, …
    • Collect results: 2 “Mary”, 1 “had”, 1 “a”, 1 “little”, 2 “lamb”, …
    “ Mary had a little lamb. It’s fleece was white as snow. Everywhere that Mary went the lamb was sure to go.”
  • 6. Example 2: Search Suggestions
  • 7. Creating search suggestions
    • Gazillions of search queries in server log files
    • How many times was each word used?
    • Using Hadoop, we would:
      • Split up files
      • Count words in each
      • Sum word counts
  • 8. So, Hadoop is
    • A distributed batch processing infrastructure
    • Built to process "web-scale" data: terabytes, petabytes
    • Two components:
      • HDFS
      • MapReduce infrastructure
  • 9. HDFS
    • A distributed, fault-tolerant file system
    • It’s easier to move calculations than data
    • Hadoop will split the data for you
  • 10. MapReduce Infrastructure
    • Two steps:
      • Map
      • Reduce
    • Java, C, C++ APIs
    • Pig, Streaming
  • 11. Java Word Count: Mapper
    • //credit: http://ow.ly/1bER
    • public static class MapClass extends MapReduceBase
    • implements Mapper<LongWritable, Text, Text, IntWritable> {
    • private final static IntWritable one = new IntWritable(1);
    • private Text word = new Text();
    • public void map(LongWritable key, Text value,
    • OutputCollector<Text, IntWritable> output,
    • Reporter reporter) throws IOException {
    • String line = value.toString();
    • StringTokenizer itr = new StringTokenizer(line);
    • while (itr.hasMoreTokens()) {
    • word.set(itr.nextToken());
    • output.collect(word, one);
    • }
    • }
    • }
  • 12. Java Word Count: Reducer
    • //credit: http://ow.ly/1bER public static class Reduce extends MapReduceBase
    • implements Reducer<Text, IntWritable, Text, IntWritable> {
    • public void reduce(Text key, Iterator<IntWritable> values,
    • OutputCollector<Text, IntWritable> output,
    • Reporter reporter) throws IOException {
    • int sum = 0;
    • while (values.hasNext()) {
    • sum += values.next().get();
    • }
    • output.collect(key, new IntWritable(sum));
    • }
    • }
  • 13. Java Word Count: Running it
    • //credit: http://ow.ly/1bER
    • public class WordCount {
    • ……
    • public static void main(String[] args) throws IOException {
    • JobConf conf = new JobConf(WordCount.class);
    • conf.setJobName(&quot;wordcount&quot;);
    • // the keys are words (strings)
    • conf.setOutputKeyClass(Text.class);
    • // the values are counts (ints)
    • conf.setOutputValueClass(IntWritable.class);
    • conf.setMapperClass(MapClass.class);
    • conf.setReducerClass(Reduce.class);
    • conf.setInputPath(new Path(args[0]);
    • conf.setOutputPath(new Path(args[1]);
    • JobClient.runJob(conf);
    • … ..
  • 14. Streaming Word Count
    • //credit: http://ow.ly/1bER
    • bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh -reducer streamingReducer.sh
    • streamingMapper.sh: /bin/sed -e 's| | |g' | /bin/grep .
    • streamingReducer: /usr/bin/uniq -c | /bin/awk '{print $2 &quot; &quot; $1}'
  • 15. Pig Word Count
    • //credit: http://ow.ly/1bER
    • input = LOAD “in-dir” USING TextLoader();
    • words = FOREACH input GENERATE FLATTEN(TOKENIZE(*));
    • grouped = GROUP words BY $0;
    • counts = FOREACH grouped GENERATE group, COUNT(words);
    • STORE counts INTO “out-dir”;
  • 16. Beyond Word Count
    • Yahoo! Search
      • Generating their Web Map
    • Zattoo
      • Computing viewership stats
    • New York Times
      • Converting their archives to pdf
    • Last.fm
      • Improving their streams by learning from track skipping patterns
    • Facebook
      • Indexing mail accounts
  • 17. Why use Hadoop?
    • Do you have a very large data set?
    • Hadoop works with cheap hardware
    • Simplified programming model
  • 18. How do I use it?
    • Download Hadoop
    • Define cluster in Hadoop settings
    • Import data using Hadoop
    • Define job using API, Pig, or streaming
    • Run job
    • Output is saved to file(s)
    • Sign up for Hadoop mailing list
  • 19. Resources
    • Hadoop project site
    • Yahoo! Hadoop tutorial
    • Hadoop Word Count ( pdf )
    • Owen O’Malley’s intro to Hadoop
    • Ruby Word Count example
    • Tutorial on Hadoop + EC2 + S3
    • Tutorial on single-node Hadoop
  • 20. Thank you!
    • [email_address]
    • Twitter: erikeldridge
    • Presentation is available here: slideshare . net/erikeldridge