Your SlideShare is downloading. ×
0
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Hadoop high-level intro - U. of Mich. Hack U '09
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop high-level intro - U. of Mich. Hack U '09

4,942

Published on

This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.

This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.

Published in: Technology, Education
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,942
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
256
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1. Hadoop: A (very) high-level overview University of Michigan Hack U ’08 Erik Eldridge Yahoo! Developer Network Photo credit: Swami Stream (http://ow.ly/17tC)
    • 2. Overview <ul><li>What is it? </li></ul><ul><li>Example 1: word count </li></ul><ul><li>Example 2: search suggestions </li></ul><ul><li>Why would I use it? </li></ul><ul><li>How do I use it? </li></ul><ul><li>Some Code </li></ul>
    • 3. Before I continue… <ul><li>Slides are available here: slideshare . net/erikeldridge </li></ul>
    • 4. Hadoop is <ul><li>Software for breaking a big job into smaller tasks, performing each task, and collecting the results </li></ul>
    • 5. Example 1: Counting Words <ul><li>Split into 3 sentences </li></ul><ul><li>Count words in each sentence </li></ul><ul><ul><li>1 “Mary”, 1 “had”, 1 “a”, … </li></ul></ul><ul><ul><li>1 “It’s”, 1 “fleece”, 1 “was”, … </li></ul></ul><ul><ul><li>1 “Everywhere”, 1 “that”, 1 “Mary”, … </li></ul></ul><ul><li>Collect results: 2 “Mary”, 1 “had”, 1 “a”, 1 “little”, 2 “lamb”, … </li></ul>“ Mary had a little lamb. It’s fleece was white as snow. Everywhere that Mary went the lamb was sure to go.”
    • 6. Example 2: Search Suggestions
    • 7. Creating search suggestions <ul><li>Gazillions of search queries in server log files </li></ul><ul><li>How many times was each word used? </li></ul><ul><li>Using Hadoop, we would: </li></ul><ul><ul><li>Split up files </li></ul></ul><ul><ul><li>Count words in each </li></ul></ul><ul><ul><li>Sum word counts </li></ul></ul>
    • 8. So, Hadoop is <ul><li>A distributed batch processing infrastructure </li></ul><ul><li>Built to process &amp;quot;web-scale&amp;quot; data: terabytes, petabytes </li></ul><ul><li>Two components: </li></ul><ul><ul><li>HDFS </li></ul></ul><ul><ul><li>MapReduce infrastructure </li></ul></ul>
    • 9. HDFS <ul><li>A distributed, fault-tolerant file system </li></ul><ul><li>It’s easier to move calculations than data </li></ul><ul><li>Hadoop will split the data for you </li></ul>
    • 10. MapReduce Infrastructure <ul><li>Two steps: </li></ul><ul><ul><li>Map </li></ul></ul><ul><ul><li>Reduce </li></ul></ul><ul><li>Java, C, C++ APIs </li></ul><ul><li>Pig, Streaming </li></ul>
    • 11. Java Word Count: Mapper <ul><li>//credit: http://ow.ly/1bER </li></ul><ul><li>public static class MapClass extends MapReduceBase </li></ul><ul><li>implements Mapper&lt;LongWritable, Text, Text, IntWritable&gt; { </li></ul><ul><li>private final static IntWritable one = new IntWritable(1); </li></ul><ul><li>private Text word = new Text(); </li></ul><ul><li>public void map(LongWritable key, Text value, </li></ul><ul><li>OutputCollector&lt;Text, IntWritable&gt; output, </li></ul><ul><li>Reporter reporter) throws IOException { </li></ul><ul><li>String line = value.toString(); </li></ul><ul><li>StringTokenizer itr = new StringTokenizer(line); </li></ul><ul><li>while (itr.hasMoreTokens()) { </li></ul><ul><li>word.set(itr.nextToken()); </li></ul><ul><li>output.collect(word, one); </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>
    • 12. Java Word Count: Reducer <ul><li>//credit: http://ow.ly/1bER public static class Reduce extends MapReduceBase </li></ul><ul><li>implements Reducer&lt;Text, IntWritable, Text, IntWritable&gt; { </li></ul><ul><li>public void reduce(Text key, Iterator&lt;IntWritable&gt; values, </li></ul><ul><li>OutputCollector&lt;Text, IntWritable&gt; output, </li></ul><ul><li>Reporter reporter) throws IOException { </li></ul><ul><li>int sum = 0; </li></ul><ul><li>while (values.hasNext()) { </li></ul><ul><li>sum += values.next().get(); </li></ul><ul><li>} </li></ul><ul><li>output.collect(key, new IntWritable(sum)); </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>
    • 13. Java Word Count: Running it <ul><li>//credit: http://ow.ly/1bER </li></ul><ul><li>public class WordCount { </li></ul><ul><li>…… </li></ul><ul><li>public static void main(String[] args) throws IOException { </li></ul><ul><li>JobConf conf = new JobConf(WordCount.class); </li></ul><ul><li>conf.setJobName(&amp;quot;wordcount&amp;quot;); </li></ul><ul><li>// the keys are words (strings) </li></ul><ul><li>conf.setOutputKeyClass(Text.class); </li></ul><ul><li>// the values are counts (ints) </li></ul><ul><li>conf.setOutputValueClass(IntWritable.class); </li></ul><ul><li>conf.setMapperClass(MapClass.class); </li></ul><ul><li>conf.setReducerClass(Reduce.class); </li></ul><ul><li>conf.setInputPath(new Path(args[0]); </li></ul><ul><li>conf.setOutputPath(new Path(args[1]); </li></ul><ul><li>JobClient.runJob(conf); </li></ul><ul><li>… .. </li></ul>
    • 14. Streaming Word Count <ul><li>//credit: http://ow.ly/1bER </li></ul><ul><li>bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh -reducer streamingReducer.sh </li></ul><ul><li>streamingMapper.sh: /bin/sed -e &apos;s| | |g&apos; | /bin/grep . </li></ul><ul><li>streamingReducer: /usr/bin/uniq -c | /bin/awk &apos;{print $2 &amp;quot; &amp;quot; $1}&apos; </li></ul>
    • 15. Pig Word Count <ul><li>//credit: http://ow.ly/1bER </li></ul><ul><li>input = LOAD “in-dir” USING TextLoader(); </li></ul><ul><li>words = FOREACH input GENERATE FLATTEN(TOKENIZE(*)); </li></ul><ul><li>grouped = GROUP words BY $0; </li></ul><ul><li>counts = FOREACH grouped GENERATE group, COUNT(words); </li></ul><ul><li>STORE counts INTO “out-dir”; </li></ul>
    • 16. Beyond Word Count <ul><li>Yahoo! Search </li></ul><ul><ul><li>Generating their Web Map </li></ul></ul><ul><li>Zattoo </li></ul><ul><ul><li>Computing viewership stats </li></ul></ul><ul><li>New York Times </li></ul><ul><ul><li>Converting their archives to pdf </li></ul></ul><ul><li>Last.fm </li></ul><ul><ul><li>Improving their streams by learning from track skipping patterns </li></ul></ul><ul><li>Facebook </li></ul><ul><ul><li>Indexing mail accounts </li></ul></ul>
    • 17. Why use Hadoop? <ul><li>Do you have a very large data set? </li></ul><ul><li>Hadoop works with cheap hardware </li></ul><ul><li>Simplified programming model </li></ul>
    • 18. How do I use it? <ul><li>Download Hadoop </li></ul><ul><li>Define cluster in Hadoop settings </li></ul><ul><li>Import data using Hadoop </li></ul><ul><li>Define job using API, Pig, or streaming </li></ul><ul><li>Run job </li></ul><ul><li>Output is saved to file(s) </li></ul><ul><li>Sign up for Hadoop mailing list </li></ul>
    • 19. Resources <ul><li>Hadoop project site </li></ul><ul><li>Yahoo! Hadoop tutorial </li></ul><ul><li>Hadoop Word Count ( pdf ) </li></ul><ul><li>Owen O’Malley’s intro to Hadoop </li></ul><ul><li>Ruby Word Count example </li></ul><ul><li>Tutorial on Hadoop + EC2 + S3 </li></ul><ul><li>Tutorial on single-node Hadoop </li></ul>
    • 20. Thank you! <ul><li>[email_address] </li></ul><ul><li>Twitter: erikeldridge </li></ul><ul><li>Presentation is available here: slideshare . net/erikeldridge </li></ul>

    ×