Hadoop high-level intro - U. of Mich. Hack U '09


Published on

This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hadoop high-level intro - U. of Mich. Hack U '09

    1. 1. Hadoop: A (very) high-level overview University of Michigan Hack U ’08 Erik Eldridge Yahoo! Developer Network Photo credit: Swami Stream (http://ow.ly/17tC)
    2. 2. Overview <ul><li>What is it? </li></ul><ul><li>Example 1: word count </li></ul><ul><li>Example 2: search suggestions </li></ul><ul><li>Why would I use it? </li></ul><ul><li>How do I use it? </li></ul><ul><li>Some Code </li></ul>
    3. 3. Before I continue… <ul><li>Slides are available here: slideshare . net/erikeldridge </li></ul>
    4. 4. Hadoop is <ul><li>Software for breaking a big job into smaller tasks, performing each task, and collecting the results </li></ul>
    5. 5. Example 1: Counting Words <ul><li>Split into 3 sentences </li></ul><ul><li>Count words in each sentence </li></ul><ul><ul><li>1 “Mary”, 1 “had”, 1 “a”, … </li></ul></ul><ul><ul><li>1 “It’s”, 1 “fleece”, 1 “was”, … </li></ul></ul><ul><ul><li>1 “Everywhere”, 1 “that”, 1 “Mary”, … </li></ul></ul><ul><li>Collect results: 2 “Mary”, 1 “had”, 1 “a”, 1 “little”, 2 “lamb”, … </li></ul>“ Mary had a little lamb. It’s fleece was white as snow. Everywhere that Mary went the lamb was sure to go.”
    6. 6. Example 2: Search Suggestions
    7. 7. Creating search suggestions <ul><li>Gazillions of search queries in server log files </li></ul><ul><li>How many times was each word used? </li></ul><ul><li>Using Hadoop, we would: </li></ul><ul><ul><li>Split up files </li></ul></ul><ul><ul><li>Count words in each </li></ul></ul><ul><ul><li>Sum word counts </li></ul></ul>
    8. 8. So, Hadoop is <ul><li>A distributed batch processing infrastructure </li></ul><ul><li>Built to process &quot;web-scale&quot; data: terabytes, petabytes </li></ul><ul><li>Two components: </li></ul><ul><ul><li>HDFS </li></ul></ul><ul><ul><li>MapReduce infrastructure </li></ul></ul>
    9. 9. HDFS <ul><li>A distributed, fault-tolerant file system </li></ul><ul><li>It’s easier to move calculations than data </li></ul><ul><li>Hadoop will split the data for you </li></ul>
    10. 10. MapReduce Infrastructure <ul><li>Two steps: </li></ul><ul><ul><li>Map </li></ul></ul><ul><ul><li>Reduce </li></ul></ul><ul><li>Java, C, C++ APIs </li></ul><ul><li>Pig, Streaming </li></ul>
    11. 11. Java Word Count: Mapper <ul><li>//credit: http://ow.ly/1bER </li></ul><ul><li>public static class MapClass extends MapReduceBase </li></ul><ul><li>implements Mapper<LongWritable, Text, Text, IntWritable> { </li></ul><ul><li>private final static IntWritable one = new IntWritable(1); </li></ul><ul><li>private Text word = new Text(); </li></ul><ul><li>public void map(LongWritable key, Text value, </li></ul><ul><li>OutputCollector<Text, IntWritable> output, </li></ul><ul><li>Reporter reporter) throws IOException { </li></ul><ul><li>String line = value.toString(); </li></ul><ul><li>StringTokenizer itr = new StringTokenizer(line); </li></ul><ul><li>while (itr.hasMoreTokens()) { </li></ul><ul><li>word.set(itr.nextToken()); </li></ul><ul><li>output.collect(word, one); </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>
    12. 12. Java Word Count: Reducer <ul><li>//credit: http://ow.ly/1bER public static class Reduce extends MapReduceBase </li></ul><ul><li>implements Reducer<Text, IntWritable, Text, IntWritable> { </li></ul><ul><li>public void reduce(Text key, Iterator<IntWritable> values, </li></ul><ul><li>OutputCollector<Text, IntWritable> output, </li></ul><ul><li>Reporter reporter) throws IOException { </li></ul><ul><li>int sum = 0; </li></ul><ul><li>while (values.hasNext()) { </li></ul><ul><li>sum += values.next().get(); </li></ul><ul><li>} </li></ul><ul><li>output.collect(key, new IntWritable(sum)); </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>
    13. 13. Java Word Count: Running it <ul><li>//credit: http://ow.ly/1bER </li></ul><ul><li>public class WordCount { </li></ul><ul><li>…… </li></ul><ul><li>public static void main(String[] args) throws IOException { </li></ul><ul><li>JobConf conf = new JobConf(WordCount.class); </li></ul><ul><li>conf.setJobName(&quot;wordcount&quot;); </li></ul><ul><li>// the keys are words (strings) </li></ul><ul><li>conf.setOutputKeyClass(Text.class); </li></ul><ul><li>// the values are counts (ints) </li></ul><ul><li>conf.setOutputValueClass(IntWritable.class); </li></ul><ul><li>conf.setMapperClass(MapClass.class); </li></ul><ul><li>conf.setReducerClass(Reduce.class); </li></ul><ul><li>conf.setInputPath(new Path(args[0]); </li></ul><ul><li>conf.setOutputPath(new Path(args[1]); </li></ul><ul><li>JobClient.runJob(conf); </li></ul><ul><li>… .. </li></ul>
    14. 14. Streaming Word Count <ul><li>//credit: http://ow.ly/1bER </li></ul><ul><li>bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh -reducer streamingReducer.sh </li></ul><ul><li>streamingMapper.sh: /bin/sed -e 's| | |g' | /bin/grep . </li></ul><ul><li>streamingReducer: /usr/bin/uniq -c | /bin/awk '{print $2 &quot; &quot; $1}' </li></ul>
    15. 15. Pig Word Count <ul><li>//credit: http://ow.ly/1bER </li></ul><ul><li>input = LOAD “in-dir” USING TextLoader(); </li></ul><ul><li>words = FOREACH input GENERATE FLATTEN(TOKENIZE(*)); </li></ul><ul><li>grouped = GROUP words BY $0; </li></ul><ul><li>counts = FOREACH grouped GENERATE group, COUNT(words); </li></ul><ul><li>STORE counts INTO “out-dir”; </li></ul>
    16. 16. Beyond Word Count <ul><li>Yahoo! Search </li></ul><ul><ul><li>Generating their Web Map </li></ul></ul><ul><li>Zattoo </li></ul><ul><ul><li>Computing viewership stats </li></ul></ul><ul><li>New York Times </li></ul><ul><ul><li>Converting their archives to pdf </li></ul></ul><ul><li>Last.fm </li></ul><ul><ul><li>Improving their streams by learning from track skipping patterns </li></ul></ul><ul><li>Facebook </li></ul><ul><ul><li>Indexing mail accounts </li></ul></ul>
    17. 17. Why use Hadoop? <ul><li>Do you have a very large data set? </li></ul><ul><li>Hadoop works with cheap hardware </li></ul><ul><li>Simplified programming model </li></ul>
    18. 18. How do I use it? <ul><li>Download Hadoop </li></ul><ul><li>Define cluster in Hadoop settings </li></ul><ul><li>Import data using Hadoop </li></ul><ul><li>Define job using API, Pig, or streaming </li></ul><ul><li>Run job </li></ul><ul><li>Output is saved to file(s) </li></ul><ul><li>Sign up for Hadoop mailing list </li></ul>
    19. 19. Resources <ul><li>Hadoop project site </li></ul><ul><li>Yahoo! Hadoop tutorial </li></ul><ul><li>Hadoop Word Count ( pdf ) </li></ul><ul><li>Owen O’Malley’s intro to Hadoop </li></ul><ul><li>Ruby Word Count example </li></ul><ul><li>Tutorial on Hadoop + EC2 + S3 </li></ul><ul><li>Tutorial on single-node Hadoop </li></ul>
    20. 20. Thank you! <ul><li>[email_address] </li></ul><ul><li>Twitter: erikeldridge </li></ul><ul><li>Presentation is available here: slideshare . net/erikeldridge </li></ul>