Hadoop high-level intro - U. of Mich. Hack U '09

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Hadoop high-level intro - U. of Mich. Hack U '09 - Presentation Transcript

    1. Hadoop: A (very) high-level overview University of Michigan Hack U ’08 Erik Eldridge Yahoo! Developer Network Photo credit: Swami Stream (http://ow.ly/17tC)
    2. Overview
      • What is it?
      • Example 1: word count
      • Example 2: search suggestions
      • Why would I use it?
      • How do I use it?
      • Some Code
    3. Before I continue…
      • Slides are available here: slideshare . net/erikeldridge
    4. Hadoop is
      • Software for breaking a big job into smaller tasks, performing each task, and collecting the results
    5. Example 1: Counting Words
      • Split into 3 sentences
      • Count words in each sentence
        • 1 “Mary”, 1 “had”, 1 “a”, …
        • 1 “It’s”, 1 “fleece”, 1 “was”, …
        • 1 “Everywhere”, 1 “that”, 1 “Mary”, …
      • Collect results: 2 “Mary”, 1 “had”, 1 “a”, 1 “little”, 2 “lamb”, …
      “ Mary had a little lamb. It’s fleece was white as snow. Everywhere that Mary went the lamb was sure to go.”
    6. Example 2: Search Suggestions
    7. Creating search suggestions
      • Gazillions of search queries in server log files
      • How many times was each word used?
      • Using Hadoop, we would:
        • Split up files
        • Count words in each
        • Sum word counts
    8. So, Hadoop is
      • A distributed batch processing infrastructure
      • Built to process "web-scale" data: terabytes, petabytes
      • Two components:
        • HDFS
        • MapReduce infrastructure
    9. HDFS
      • A distributed, fault-tolerant file system
      • It’s easier to move calculations than data
      • Hadoop will split the data for you
    10. MapReduce Infrastructure
      • Two steps:
        • Map
        • Reduce
      • Java, C, C++ APIs
      • Pig, Streaming
    11. Java Word Count: Mapper
      • //credit: http://ow.ly/1bER
      • public static class MapClass extends MapReduceBase
      • implements Mapper<LongWritable, Text, Text, IntWritable> {
      • private final static IntWritable one = new IntWritable(1);
      • private Text word = new Text();
      • public void map(LongWritable key, Text value,
      • OutputCollector<Text, IntWritable> output,
      • Reporter reporter) throws IOException {
      • String line = value.toString();
      • StringTokenizer itr = new StringTokenizer(line);
      • while (itr.hasMoreTokens()) {
      • word.set(itr.nextToken());
      • output.collect(word, one);
      • }
      • }
      • }
    12. Java Word Count: Reducer
      • //credit: http://ow.ly/1bER public static class Reduce extends MapReduceBase
      • implements Reducer<Text, IntWritable, Text, IntWritable> {
      • public void reduce(Text key, Iterator<IntWritable> values,
      • OutputCollector<Text, IntWritable> output,
      • Reporter reporter) throws IOException {
      • int sum = 0;
      • while (values.hasNext()) {
      • sum += values.next().get();
      • }
      • output.collect(key, new IntWritable(sum));
      • }
      • }
    13. Java Word Count: Running it
      • //credit: http://ow.ly/1bER
      • public class WordCount {
      • ……
      • public static void main(String[] args) throws IOException {
      • JobConf conf = new JobConf(WordCount.class);
      • conf.setJobName(&quot;wordcount&quot;);
      • // the keys are words (strings)
      • conf.setOutputKeyClass(Text.class);
      • // the values are counts (ints)
      • conf.setOutputValueClass(IntWritable.class);
      • conf.setMapperClass(MapClass.class);
      • conf.setReducerClass(Reduce.class);
      • conf.setInputPath(new Path(args[0]);
      • conf.setOutputPath(new Path(args[1]);
      • JobClient.runJob(conf);
      • … ..
    14. Streaming Word Count
      • //credit: http://ow.ly/1bER
      • bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh -reducer streamingReducer.sh
      • streamingMapper.sh: /bin/sed -e 's| | |g' | /bin/grep .
      • streamingReducer: /usr/bin/uniq -c | /bin/awk '{print $2 &quot; &quot; $1}'
    15. Pig Word Count
      • //credit: http://ow.ly/1bER
      • input = LOAD “in-dir” USING TextLoader();
      • words = FOREACH input GENERATE FLATTEN(TOKENIZE(*));
      • grouped = GROUP words BY $0;
      • counts = FOREACH grouped GENERATE group, COUNT(words);
      • STORE counts INTO “out-dir”;
    16. Beyond Word Count
      • Yahoo! Search
        • Generating their Web Map
      • Zattoo
        • Computing viewership stats
      • New York Times
        • Converting their archives to pdf
      • Last.fm
        • Improving their streams by learning from track skipping patterns
      • Facebook
        • Indexing mail accounts
    17. Why use Hadoop?
      • Do you have a very large data set?
      • Hadoop works with cheap hardware
      • Simplified programming model
    18. How do I use it?
      • Download Hadoop
      • Define cluster in Hadoop settings
      • Import data using Hadoop
      • Define job using API, Pig, or streaming
      • Run job
      • Output is saved to file(s)
      • Sign up for Hadoop mailing list
    19. Resources
      • Hadoop project site
      • Yahoo! Hadoop tutorial
      • Hadoop Word Count ( pdf )
      • Owen O’Malley’s intro to Hadoop
      • Ruby Word Count example
      • Tutorial on Hadoop + EC2 + S3
      • Tutorial on single-node Hadoop
    20. Thank you!
      • [email_address]
      • Twitter: erikeldridge
      • Presentation is available here: slideshare . net/erikeldridge
    SlideShare Zeitgeist 2009

    + Erik EldridgeErik Eldridge Nominate

    custom

    1321 views, 1 favs, 0 embeds more stats

    This is a very high-level introduction to Hadoop de more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1321
      • 1321 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 56
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories