Hadoop high-level intro - U. of Mich. Hack U '09
Upcoming SlideShare
Loading in...5
×
 

Hadoop high-level intro - U. of Mich. Hack U '09

on

  • 7,343 views

This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.

This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.

Statistics

Views

Total Views
7,343
Views on SlideShare
7,329
Embed Views
14

Actions

Likes
7
Downloads
251
Comments
0

1 Embed 14

http://www.slideshare.net 14

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop high-level intro - U. of Mich. Hack U '09 Hadoop high-level intro - U. of Mich. Hack U '09 Presentation Transcript

  • Hadoop: A (very) high-level overview University of Michigan Hack U ’08 Erik Eldridge Yahoo! Developer Network Photo credit: Swami Stream (http://ow.ly/17tC)
  • Overview
    • What is it?
    • Example 1: word count
    • Example 2: search suggestions
    • Why would I use it?
    • How do I use it?
    • Some Code
  • Before I continue…
    • Slides are available here: slideshare . net/erikeldridge
  • Hadoop is
    • Software for breaking a big job into smaller tasks, performing each task, and collecting the results
  • Example 1: Counting Words
    • Split into 3 sentences
    • Count words in each sentence
      • 1 “Mary”, 1 “had”, 1 “a”, …
      • 1 “It’s”, 1 “fleece”, 1 “was”, …
      • 1 “Everywhere”, 1 “that”, 1 “Mary”, …
    • Collect results: 2 “Mary”, 1 “had”, 1 “a”, 1 “little”, 2 “lamb”, …
    “ Mary had a little lamb. It’s fleece was white as snow. Everywhere that Mary went the lamb was sure to go.”
  • Example 2: Search Suggestions
  • Creating search suggestions
    • Gazillions of search queries in server log files
    • How many times was each word used?
    • Using Hadoop, we would:
      • Split up files
      • Count words in each
      • Sum word counts
  • So, Hadoop is
    • A distributed batch processing infrastructure
    • Built to process "web-scale" data: terabytes, petabytes
    • Two components:
      • HDFS
      • MapReduce infrastructure
  • HDFS
    • A distributed, fault-tolerant file system
    • It’s easier to move calculations than data
    • Hadoop will split the data for you
  • MapReduce Infrastructure
    • Two steps:
      • Map
      • Reduce
    • Java, C, C++ APIs
    • Pig, Streaming
  • Java Word Count: Mapper
    • //credit: http://ow.ly/1bER
    • public static class MapClass extends MapReduceBase
    • implements Mapper<LongWritable, Text, Text, IntWritable> {
    • private final static IntWritable one = new IntWritable(1);
    • private Text word = new Text();
    • public void map(LongWritable key, Text value,
    • OutputCollector<Text, IntWritable> output,
    • Reporter reporter) throws IOException {
    • String line = value.toString();
    • StringTokenizer itr = new StringTokenizer(line);
    • while (itr.hasMoreTokens()) {
    • word.set(itr.nextToken());
    • output.collect(word, one);
    • }
    • }
    • }
  • Java Word Count: Reducer
    • //credit: http://ow.ly/1bER public static class Reduce extends MapReduceBase
    • implements Reducer<Text, IntWritable, Text, IntWritable> {
    • public void reduce(Text key, Iterator<IntWritable> values,
    • OutputCollector<Text, IntWritable> output,
    • Reporter reporter) throws IOException {
    • int sum = 0;
    • while (values.hasNext()) {
    • sum += values.next().get();
    • }
    • output.collect(key, new IntWritable(sum));
    • }
    • }
  • Java Word Count: Running it
    • //credit: http://ow.ly/1bER
    • public class WordCount {
    • ……
    • public static void main(String[] args) throws IOException {
    • JobConf conf = new JobConf(WordCount.class);
    • conf.setJobName(&quot;wordcount&quot;);
    • // the keys are words (strings)
    • conf.setOutputKeyClass(Text.class);
    • // the values are counts (ints)
    • conf.setOutputValueClass(IntWritable.class);
    • conf.setMapperClass(MapClass.class);
    • conf.setReducerClass(Reduce.class);
    • conf.setInputPath(new Path(args[0]);
    • conf.setOutputPath(new Path(args[1]);
    • JobClient.runJob(conf);
    • … ..
  • Streaming Word Count
    • //credit: http://ow.ly/1bER
    • bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh -reducer streamingReducer.sh
    • streamingMapper.sh: /bin/sed -e 's| | |g' | /bin/grep .
    • streamingReducer: /usr/bin/uniq -c | /bin/awk '{print $2 &quot; &quot; $1}'
  • Pig Word Count
    • //credit: http://ow.ly/1bER
    • input = LOAD “in-dir” USING TextLoader();
    • words = FOREACH input GENERATE FLATTEN(TOKENIZE(*));
    • grouped = GROUP words BY $0;
    • counts = FOREACH grouped GENERATE group, COUNT(words);
    • STORE counts INTO “out-dir”;
  • Beyond Word Count
    • Yahoo! Search
      • Generating their Web Map
    • Zattoo
      • Computing viewership stats
    • New York Times
      • Converting their archives to pdf
    • Last.fm
      • Improving their streams by learning from track skipping patterns
    • Facebook
      • Indexing mail accounts
  • Why use Hadoop?
    • Do you have a very large data set?
    • Hadoop works with cheap hardware
    • Simplified programming model
  • How do I use it?
    • Download Hadoop
    • Define cluster in Hadoop settings
    • Import data using Hadoop
    • Define job using API, Pig, or streaming
    • Run job
    • Output is saved to file(s)
    • Sign up for Hadoop mailing list
  • Resources
    • Hadoop project site
    • Yahoo! Hadoop tutorial
    • Hadoop Word Count ( pdf )
    • Owen O’Malley’s intro to Hadoop
    • Ruby Word Count example
    • Tutorial on Hadoop + EC2 + S3
    • Tutorial on single-node Hadoop
  • Thank you!
    • [email_address]
    • Twitter: erikeldridge
    • Presentation is available here: slideshare . net/erikeldridge