Hadoop high-level intro - U. of Mich. Hack U '09
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Hadoop high-level intro - U. of Mich. Hack U '09

  • 7,452 views
Uploaded on

This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.

This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
7,452
On Slideshare
7,438
From Embeds
14
Number of Embeds
1

Actions

Shares
Downloads
251
Comments
0
Likes
7

Embeds 14

http://www.slideshare.net 14

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop: A (very) high-level overview University of Michigan Hack U ’08 Erik Eldridge Yahoo! Developer Network Photo credit: Swami Stream (http://ow.ly/17tC)
  • 2. Overview
    • What is it?
    • Example 1: word count
    • Example 2: search suggestions
    • Why would I use it?
    • How do I use it?
    • Some Code
  • 3. Before I continue…
    • Slides are available here: slideshare . net/erikeldridge
  • 4. Hadoop is
    • Software for breaking a big job into smaller tasks, performing each task, and collecting the results
  • 5. Example 1: Counting Words
    • Split into 3 sentences
    • Count words in each sentence
      • 1 “Mary”, 1 “had”, 1 “a”, …
      • 1 “It’s”, 1 “fleece”, 1 “was”, …
      • 1 “Everywhere”, 1 “that”, 1 “Mary”, …
    • Collect results: 2 “Mary”, 1 “had”, 1 “a”, 1 “little”, 2 “lamb”, …
    “ Mary had a little lamb. It’s fleece was white as snow. Everywhere that Mary went the lamb was sure to go.”
  • 6. Example 2: Search Suggestions
  • 7. Creating search suggestions
    • Gazillions of search queries in server log files
    • How many times was each word used?
    • Using Hadoop, we would:
      • Split up files
      • Count words in each
      • Sum word counts
  • 8. So, Hadoop is
    • A distributed batch processing infrastructure
    • Built to process "web-scale" data: terabytes, petabytes
    • Two components:
      • HDFS
      • MapReduce infrastructure
  • 9. HDFS
    • A distributed, fault-tolerant file system
    • It’s easier to move calculations than data
    • Hadoop will split the data for you
  • 10. MapReduce Infrastructure
    • Two steps:
      • Map
      • Reduce
    • Java, C, C++ APIs
    • Pig, Streaming
  • 11. Java Word Count: Mapper
    • //credit: http://ow.ly/1bER
    • public static class MapClass extends MapReduceBase
    • implements Mapper<LongWritable, Text, Text, IntWritable> {
    • private final static IntWritable one = new IntWritable(1);
    • private Text word = new Text();
    • public void map(LongWritable key, Text value,
    • OutputCollector<Text, IntWritable> output,
    • Reporter reporter) throws IOException {
    • String line = value.toString();
    • StringTokenizer itr = new StringTokenizer(line);
    • while (itr.hasMoreTokens()) {
    • word.set(itr.nextToken());
    • output.collect(word, one);
    • }
    • }
    • }
  • 12. Java Word Count: Reducer
    • //credit: http://ow.ly/1bER public static class Reduce extends MapReduceBase
    • implements Reducer<Text, IntWritable, Text, IntWritable> {
    • public void reduce(Text key, Iterator<IntWritable> values,
    • OutputCollector<Text, IntWritable> output,
    • Reporter reporter) throws IOException {
    • int sum = 0;
    • while (values.hasNext()) {
    • sum += values.next().get();
    • }
    • output.collect(key, new IntWritable(sum));
    • }
    • }
  • 13. Java Word Count: Running it
    • //credit: http://ow.ly/1bER
    • public class WordCount {
    • ……
    • public static void main(String[] args) throws IOException {
    • JobConf conf = new JobConf(WordCount.class);
    • conf.setJobName(&quot;wordcount&quot;);
    • // the keys are words (strings)
    • conf.setOutputKeyClass(Text.class);
    • // the values are counts (ints)
    • conf.setOutputValueClass(IntWritable.class);
    • conf.setMapperClass(MapClass.class);
    • conf.setReducerClass(Reduce.class);
    • conf.setInputPath(new Path(args[0]);
    • conf.setOutputPath(new Path(args[1]);
    • JobClient.runJob(conf);
    • … ..
  • 14. Streaming Word Count
    • //credit: http://ow.ly/1bER
    • bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh -reducer streamingReducer.sh
    • streamingMapper.sh: /bin/sed -e 's| | |g' | /bin/grep .
    • streamingReducer: /usr/bin/uniq -c | /bin/awk '{print $2 &quot; &quot; $1}'
  • 15. Pig Word Count
    • //credit: http://ow.ly/1bER
    • input = LOAD “in-dir” USING TextLoader();
    • words = FOREACH input GENERATE FLATTEN(TOKENIZE(*));
    • grouped = GROUP words BY $0;
    • counts = FOREACH grouped GENERATE group, COUNT(words);
    • STORE counts INTO “out-dir”;
  • 16. Beyond Word Count
    • Yahoo! Search
      • Generating their Web Map
    • Zattoo
      • Computing viewership stats
    • New York Times
      • Converting their archives to pdf
    • Last.fm
      • Improving their streams by learning from track skipping patterns
    • Facebook
      • Indexing mail accounts
  • 17. Why use Hadoop?
    • Do you have a very large data set?
    • Hadoop works with cheap hardware
    • Simplified programming model
  • 18. How do I use it?
    • Download Hadoop
    • Define cluster in Hadoop settings
    • Import data using Hadoop
    • Define job using API, Pig, or streaming
    • Run job
    • Output is saved to file(s)
    • Sign up for Hadoop mailing list
  • 19. Resources
    • Hadoop project site
    • Yahoo! Hadoop tutorial
    • Hadoop Word Count ( pdf )
    • Owen O’Malley’s intro to Hadoop
    • Ruby Word Count example
    • Tutorial on Hadoop + EC2 + S3
    • Tutorial on single-node Hadoop
  • 20. Thank you!
    • [email_address]
    • Twitter: erikeldridge
    • Presentation is available here: slideshare . net/erikeldridge