Your SlideShare is downloading. ×
0
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop high-level intro - U. of Mich. Hack U '09

4,933

Published on

This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.

This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.

Published in: Technology, Education
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,933
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
256
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1. Hadoop: A (very) high-level overview University of Michigan Hack U ’08 Erik Eldridge Yahoo! Developer Network Photo credit: Swami Stream (http://ow.ly/17tC)
    • 2. Overview
      • What is it?
      • Example 1: word count
      • Example 2: search suggestions
      • Why would I use it?
      • How do I use it?
      • Some Code
    • 3. Before I continue…
      • Slides are available here: slideshare . net/erikeldridge
    • 4. Hadoop is
      • Software for breaking a big job into smaller tasks, performing each task, and collecting the results
    • 5. Example 1: Counting Words
      • Split into 3 sentences
      • Count words in each sentence
        • 1 “Mary”, 1 “had”, 1 “a”, …
        • 1 “It’s”, 1 “fleece”, 1 “was”, …
        • 1 “Everywhere”, 1 “that”, 1 “Mary”, …
      • Collect results: 2 “Mary”, 1 “had”, 1 “a”, 1 “little”, 2 “lamb”, …
      “ Mary had a little lamb. It’s fleece was white as snow. Everywhere that Mary went the lamb was sure to go.”
    • 6. Example 2: Search Suggestions
    • 7. Creating search suggestions
      • Gazillions of search queries in server log files
      • How many times was each word used?
      • Using Hadoop, we would:
        • Split up files
        • Count words in each
        • Sum word counts
    • 8. So, Hadoop is
      • A distributed batch processing infrastructure
      • Built to process "web-scale" data: terabytes, petabytes
      • Two components:
        • HDFS
        • MapReduce infrastructure
    • 9. HDFS
      • A distributed, fault-tolerant file system
      • It’s easier to move calculations than data
      • Hadoop will split the data for you
    • 10. MapReduce Infrastructure
      • Two steps:
        • Map
        • Reduce
      • Java, C, C++ APIs
      • Pig, Streaming
    • 11. Java Word Count: Mapper
      • //credit: http://ow.ly/1bER
      • public static class MapClass extends MapReduceBase
      • implements Mapper<LongWritable, Text, Text, IntWritable> {
      • private final static IntWritable one = new IntWritable(1);
      • private Text word = new Text();
      • public void map(LongWritable key, Text value,
      • OutputCollector<Text, IntWritable> output,
      • Reporter reporter) throws IOException {
      • String line = value.toString();
      • StringTokenizer itr = new StringTokenizer(line);
      • while (itr.hasMoreTokens()) {
      • word.set(itr.nextToken());
      • output.collect(word, one);
      • }
      • }
      • }
    • 12. Java Word Count: Reducer
      • //credit: http://ow.ly/1bER public static class Reduce extends MapReduceBase
      • implements Reducer<Text, IntWritable, Text, IntWritable> {
      • public void reduce(Text key, Iterator<IntWritable> values,
      • OutputCollector<Text, IntWritable> output,
      • Reporter reporter) throws IOException {
      • int sum = 0;
      • while (values.hasNext()) {
      • sum += values.next().get();
      • }
      • output.collect(key, new IntWritable(sum));
      • }
      • }
    • 13. Java Word Count: Running it
      • //credit: http://ow.ly/1bER
      • public class WordCount {
      • ……
      • public static void main(String[] args) throws IOException {
      • JobConf conf = new JobConf(WordCount.class);
      • conf.setJobName(&quot;wordcount&quot;);
      • // the keys are words (strings)
      • conf.setOutputKeyClass(Text.class);
      • // the values are counts (ints)
      • conf.setOutputValueClass(IntWritable.class);
      • conf.setMapperClass(MapClass.class);
      • conf.setReducerClass(Reduce.class);
      • conf.setInputPath(new Path(args[0]);
      • conf.setOutputPath(new Path(args[1]);
      • JobClient.runJob(conf);
      • … ..
    • 14. Streaming Word Count
      • //credit: http://ow.ly/1bER
      • bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh -reducer streamingReducer.sh
      • streamingMapper.sh: /bin/sed -e 's| | |g' | /bin/grep .
      • streamingReducer: /usr/bin/uniq -c | /bin/awk '{print $2 &quot; &quot; $1}'
    • 15. Pig Word Count
      • //credit: http://ow.ly/1bER
      • input = LOAD “in-dir” USING TextLoader();
      • words = FOREACH input GENERATE FLATTEN(TOKENIZE(*));
      • grouped = GROUP words BY $0;
      • counts = FOREACH grouped GENERATE group, COUNT(words);
      • STORE counts INTO “out-dir”;
    • 16. Beyond Word Count
      • Yahoo! Search
        • Generating their Web Map
      • Zattoo
        • Computing viewership stats
      • New York Times
        • Converting their archives to pdf
      • Last.fm
        • Improving their streams by learning from track skipping patterns
      • Facebook
        • Indexing mail accounts
    • 17. Why use Hadoop?
      • Do you have a very large data set?
      • Hadoop works with cheap hardware
      • Simplified programming model
    • 18. How do I use it?
      • Download Hadoop
      • Define cluster in Hadoop settings
      • Import data using Hadoop
      • Define job using API, Pig, or streaming
      • Run job
      • Output is saved to file(s)
      • Sign up for Hadoop mailing list
    • 19. Resources
      • Hadoop project site
      • Yahoo! Hadoop tutorial
      • Hadoop Word Count ( pdf )
      • Owen O’Malley’s intro to Hadoop
      • Ruby Word Count example
      • Tutorial on Hadoop + EC2 + S3
      • Tutorial on single-node Hadoop
    • 20. Thank you!
      • [email_address]
      • Twitter: erikeldridge
      • Presentation is available here: slideshare . net/erikeldridge

    ×