SlideShare a Scribd company logo
1 of 197
Download to read offline
Intro To Hadoop
                                 Bill Graham - @billgraham
               Data Systems Engineer, Analytics Infrastructure
                    Info 290 - Analyzing Big Data With Twitter
                            UC Berkeley Information School
                                        September 2012

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Outline

  • What is Big Data?
  • Hadoop
    • HDFS
    • MapReduce
  • Twitter Analytics and Hadoop




                                   2
What is big data?


  • A bunch of data?
  • An industry?
  • An expertise?
  • A trend?
  • A cliche?




                       3
Wikipedia big data


 In information technology, big data is a loosely-defined
 term used to describe data sets so large and
 complex that they become awkward to work with
 using on-hand database management tools.




    Source: http://en.wikipedia.org/wiki/Big_data      4
How big is big?




                  5
How big is big?

  • 2008: Google processes 20 PB a day




                                         5
How big is big?

  • 2008: Google processes 20 PB a day
  • 2009: Facebook has 2.5 PB user data + 15 TB/
  day




                                                   5
How big is big?

  • 2008: Google processes 20 PB a day
  • 2009: Facebook has 2.5 PB user data + 15 TB/
  day
  • 2009: eBay has 6.5 PB user data + 50 TB/day




                                                   5
How big is big?

  • 2008: Google processes 20 PB a day
  • 2009: Facebook has 2.5 PB user data + 15 TB/
  day
  • 2009: eBay has 6.5 PB user data + 50 TB/day
  • 2011: Yahoo! has 180-200 PB of data




                                                   5
How big is big?

  • 2008: Google processes 20 PB a day
  • 2009: Facebook has 2.5 PB user data + 15 TB/
  day
  • 2009: eBay has 6.5 PB user data + 50 TB/day
  • 2011: Yahoo! has 180-200 PB of data
  • 2012: Facebook ingests 500 TB/day



                                                   5
That’s a lot of data




   Credit: http://www.flickr.com/photos/19779889@N00/1367404058/   6
So what?




           7
So what?




           s/data/knowledge/g




                                7
No really, what do you do with it?




                                     8
No really, what do you do with it?

  • User behavior analysis




                                     8
No really, what do you do with it?

  • User behavior analysis
  • AB test analysis




                                     8
No really, what do you do with it?

  • User behavior analysis
  • AB test analysis
  • Ad targeting




                                     8
No really, what do you do with it?

  • User behavior analysis
  • AB test analysis
  • Ad targeting
  • Trending topics




                                     8
No really, what do you do with it?

  • User behavior analysis
  • AB test analysis
  • Ad targeting
  • Trending topics
  • User and topic modeling




                                     8
No really, what do you do with it?

  • User behavior analysis
  • AB test analysis
  • Ad targeting
  • Trending topics
  • User and topic modeling
  • Recommendations




                                     8
No really, what do you do with it?

  • User behavior analysis
  • AB test analysis
  • Ad targeting
  • Trending topics
  • User and topic modeling
  • Recommendations
  • And more...


                                     8
How to scale data?



                     9
Divide and Conquer
                “Work”
                                      Partition


       w1         w2         w3

     “worker”   “worker”   “worker”


        r1         r2         r3




                “Result”              Combine


                                                  10
Parallel processing is complicated




  Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/   11
Parallel processing is complicated

  • How do we assign tasks to workers?




  Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/   11
Parallel processing is complicated

  • How do we assign tasks to workers?
  • What if we have more tasks than slots?




  Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/   11
Parallel processing is complicated

  • How do we assign tasks to workers?
  • What if we have more tasks than slots?
  • What happens when tasks fail?




  Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/   11
Parallel processing is complicated

  • How do we assign tasks to workers?
  • What if we have more tasks than slots?
  • What happens when tasks fail?
  • How do you handle distributed
  synchronization?




  Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/   11
Data storage is not trivial
  • Data volumes are massive
  • Reliably storing PBs of data is challenging
  • Disk/hardware/network failures
  • Probability of failure event increases with number of
  machines

  For example:
   1000 hosts, each with 10 disks
   a disk lasts 3 year
   how many failures per day?
                                                        12
Hadoop cluster




    Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)



                                                                   13
Hadoop



         14
Hadoop provides



  • Redundant, fault-tolerant data storage
  • Parallel computation framework
  • Job coordination




  http://hapdoop.apache.org                  15
Joy




  Credit: http://www.flickr.com/photos/spyndle/3480602438/   16
Hadoop origins




                 17
Hadoop origins

  • Hadoop is an open-source implementation based
  on GFS and MapReduce from Google




                                                17
Hadoop origins

  • Hadoop is an open-source implementation based
  on GFS and MapReduce from Google
  • Sanjay Ghemawat, Howard Gobioff, and Shun-
  Tak Leung. (2003) The Google File System




                                                17
Hadoop origins

  • Hadoop is an open-source implementation based
  on GFS and MapReduce from Google
  • Sanjay Ghemawat, Howard Gobioff, and Shun-
  Tak Leung. (2003) The Google File System
  • Jeffrey Dean and Sanjay Ghemawat. (2004)
  MapReduce: Simplified Data Processing on Large
  Clusters. OSDI 2004



                                                17
Hadoop origins

  • Hadoop is an open-source implementation based
  on GFS and MapReduce from Google
  • Sanjay Ghemawat, Howard Gobioff, and Shun-
  Tak Leung. (2003) The Google File System
  • Jeffrey Dean and Sanjay Ghemawat. (2004)
  MapReduce: Simplified Data Processing on Large
  Clusters. OSDI 2004



                                                17
Hadoop Stack


              (Columnar Database)     Pig               Hive               Cascading
                                    (Data Flow)           (SQL)                (Java)
      HBase



                                                     MapReduce
                                              (Distributed Programming Framework)




                                                  HDFS
                                        (Hadoop Distributed File System)




                                                                                        18
HDFS



       19
HDFS is...




             20
HDFS is...
  • A distributed file system




                               20
HDFS is...
  • A distributed file system
  • Redundant storage




                               20
HDFS is...
  • A distributed file system
  • Redundant storage
  • Designed to reliably store data using commodity
  hardware




                                                      20
HDFS is...
  • A distributed file system
  • Redundant storage
  • Designed to reliably store data using commodity
  hardware
  • Designed to expect hardware failures




                                                      20
HDFS is...
  • A distributed file system
  • Redundant storage
  • Designed to reliably store data using commodity
  hardware
  • Designed to expect hardware failures
  • Intended for large files




                                                      20
HDFS is...
  • A distributed file system
  • Redundant storage
  • Designed to reliably store data using commodity
  hardware
  • Designed to expect hardware failures
  • Intended for large files
  • Designed for batch inserts



                                                      20
HDFS is...
  • A distributed file system
  • Redundant storage
  • Designed to reliably store data using commodity
  hardware
  • Designed to expect hardware failures
  • Intended for large files
  • Designed for batch inserts
  • The Hadoop Distributed File System


                                                      20
HDFS - files and blocks




                         21
HDFS - files and blocks
• Files are stored as a collection of blocks




                                               21
HDFS - files and blocks
• Files are stored as a collection of blocks
• Blocks are 64 MB chunks of a file (configurable)




                                                   21
HDFS - files and blocks
• Files are stored as a collection of blocks
• Blocks are 64 MB chunks of a file (configurable)
• Blocks are replicated on 3 nodes (configurable)




                                                   21
HDFS - files and blocks
• Files are stored as a collection of blocks
• Blocks are 64 MB chunks of a file (configurable)
• Blocks are replicated on 3 nodes (configurable)
• The NameNode (NN) manages metadata about files
and blocks




                                               21
HDFS - files and blocks
• Files are stored as a collection of blocks
• Blocks are 64 MB chunks of a file (configurable)
• Blocks are replicated on 3 nodes (configurable)
• The NameNode (NN) manages metadata about files
and blocks
• The SecondaryNameNode (SNN) holds a backup
of the NN data



                                               21
HDFS - files and blocks
• Files are stored as a collection of blocks
• Blocks are 64 MB chunks of a file (configurable)
• Blocks are replicated on 3 nodes (configurable)
• The NameNode (NN) manages metadata about files
and blocks
• The SecondaryNameNode (SNN) holds a backup
of the NN data
• DataNodes (DN) store and serve blocks


                                               21
Replication


• Multiple copies of a block are stored
• Replication strategy:
   • Copy #1 on another node on same rack
   • Copy #2 on another node on different rack




                                                 22
HDFS - writes
     Client                             Master
                                                    Note: Write path for a
      File                             NameNode     single block shown.
                                                    Client writes multiple
                                                    blocks in parallel.

             block


                             Rack #1                 Rack #2
               Slave node              Slave node   Slave node

               DataNode                DataNode     DataNode



                     Block               Block        Block



                                                                             23
HDFS - reads
     Client                          Master
                                              Client reads multiple blocks
      File                      NameNode      in parallel and re-assembles
                                              into a file.


                           block N
    block 1     block 2



              Slave node        Slave node    Slave node

              DataNode          DataNode       DataNode



                Block                Block       Block



                                                                        24
What about DataNode failures?

• DNs check in with the NN to report health
• Upon failure NN orders DNs to replicate under-
replicated blocks




   Credit: http://www.flickr.com/photos/18536761@N00/367661087/   25
MapReduce



            26
MapReduce is...




                  27
MapReduce is...


  • A programming model for expressing distributed
  computations at a massive scale




                                                     27
MapReduce is...


  • A programming model for expressing distributed
  computations at a massive scale
  • An execution framework for organizing and
  performing such computations




                                                     27
MapReduce is...


  • A programming model for expressing distributed
  computations at a massive scale
  • An execution framework for organizing and
  performing such computations
  • An open-source implementation called Hadoop




                                                     27
Typical large-data problem




   (Dean and Ghemawat, OSDI 2004)   28
Typical large-data problem


  • Iterate over a large number of records




   (Dean and Ghemawat, OSDI 2004)            28
Typical large-data problem


  • Iterate over a large number of records
  • Extract something of interest from each




   (Dean and Ghemawat, OSDI 2004)             28
Typical large-data problem


  • Iterate over a large number of records
  • Extract something of interest from each
  • Shuffle and sort intermediate results




   (Dean and Ghemawat, OSDI 2004)             28
Typical large-data problem


  • Iterate over a large number of records
  • Extract something of interest from each
  • Shuffle and sort intermediate results
  • Aggregate intermediate results




   (Dean and Ghemawat, OSDI 2004)             28
Typical large-data problem


  • Iterate over a large number of records
  • Extract something of interest from each
  • Shuffle and sort intermediate results
  • Aggregate intermediate results
  • Generate final output




   (Dean and Ghemawat, OSDI 2004)             28
Typical large-data problem


  • Iterate over a large number of records
                                              Map
  • Extract something of interest from each
  • Shuffle and sort intermediate results
  • Aggregate intermediate results
  • Generate final output




   (Dean and Ghemawat, OSDI 2004)                   28
Typical large-data problem


  • Iterate over a large number of records
                                              Map
  • Extract something of interest from each
  • Shuffle and sort intermediate results
  • Aggregate intermediate results            Reduce

  • Generate final output




   (Dean and Ghemawat, OSDI 2004)                      28
MapReduce paradigm




                     29
MapReduce paradigm


  • Implement two functions:




                               29
MapReduce paradigm


  • Implement two functions:
      Map(k1, v1) -> list(k2, v2)




                                    29
MapReduce paradigm


  • Implement two functions:
      Map(k1, v1) -> list(k2, v2)
      Reduce(k2, list(v2)) -> list(v3)




                                         29
MapReduce paradigm


  • Implement two functions:
      Map(k1, v1) -> list(k2, v2)
      Reduce(k2, list(v2)) -> list(v3)
  • Framework handles everything else*




                                         29
MapReduce paradigm


  • Implement two functions:
      Map(k1, v1) -> list(k2, v2)
      Reduce(k2, list(v2)) -> list(v3)
  • Framework handles everything else*
  • Value with same key go to same reducer




                                             29
MapReduce Flow
               k1 v1   k2 v2   k3 v3      k4 v4   k5 v5    k6 v6




      map                  map                    map                map


     a 1    b 2        c   3   c   6           a 5   c 2           b 7   c 8

           Shuffle and Sort: aggregate values by keys
                  a    1 5               b     2 7           c     2 3 6 8




             reduce                reduce                 reduce


               r1 s1                   r2 s2               r3 s3
                                                                               30
MapReduce - word count example

function map(String name, String document):
  for each word w in document:
    emit(w, 1)

function reduce(String word, Iterator partialCounts):
  totalCount = 0
  for each count in partialCounts:
    totalCount += count
  emit(word, totalCount)




                                                        31
MapReduce paradigm - part 2




                              32
MapReduce paradigm - part 2
  • There’s more!




                              32
MapReduce paradigm - part 2
  • There’s more!
  • Partioners decide what key goes to what reducer




                                                  32
MapReduce paradigm - part 2
  • There’s more!
  • Partioners decide what key goes to what reducer
    • partition(k’, numPartitions) ->
    partNumber




                                                  32
MapReduce paradigm - part 2
  • There’s more!
  • Partioners decide what key goes to what reducer
    • partition(k’, numPartitions) ->
    partNumber
    • Divides key space into parallel reducers chunks




                                                    32
MapReduce paradigm - part 2
  • There’s more!
  • Partioners decide what key goes to what reducer
    • partition(k’, numPartitions) ->
    partNumber
    • Divides key space into parallel reducers chunks
    • Default is hash-based




                                                    32
MapReduce paradigm - part 2
  • There’s more!
  • Partioners decide what key goes to what reducer
    • partition(k’, numPartitions) ->
    partNumber
    • Divides key space into parallel reducers chunks
    • Default is hash-based
  • Combiners can combine Mapper output before
  sending to reducer



                                                    32
MapReduce paradigm - part 2
  • There’s more!
  • Partioners decide what key goes to what reducer
    • partition(k’, numPartitions) ->
    partNumber
    • Divides key space into parallel reducers chunks
    • Default is hash-based
  • Combiners can combine Mapper output before
  sending to reducer
    • Reduce(k2, list(v2)) -> list(v3)

                                                    32
MapReduce flow
                     k1 v1   k2 v2   k3 v3     k4 v4     k5 v5      k6 v6




          map                   map                     map                   map


        a 1    b 2           c 3     c 6            a 5     c 2             b 7   c 8

         combine              combine                  combine               combine



        a 1    b 2                 c 9              a 5     c 2             b 7   c 8

         partition             partition               partition             partition

              Shuffle and Sort: aggregate values by keys
                       a     1 5              b     2 7               c     2 9 8 8
                                                                              3 6




                 reduce                    reduce                  reduce


                     r1 s1                  r2 s2                   r3 s3


                                                                                         33
MapReduce additional details




                               34
MapReduce additional details

  • Reduce starts after all mappers complete




                                               34
MapReduce additional details

  • Reduce starts after all mappers complete
  • Mapper output gets written to disk




                                               34
MapReduce additional details

  • Reduce starts after all mappers complete
  • Mapper output gets written to disk
  • Intermediate data can be copied sooner




                                               34
MapReduce additional details

  • Reduce starts after all mappers complete
  • Mapper output gets written to disk
  • Intermediate data can be copied sooner
  • Reducer gets keys in sorted order




                                               34
MapReduce additional details

  • Reduce starts after all mappers complete
  • Mapper output gets written to disk
  • Intermediate data can be copied sooner
  • Reducer gets keys in sorted order
  • Keys not sorted across reducers




                                               34
MapReduce additional details

  • Reduce starts after all mappers complete
  • Mapper output gets written to disk
  • Intermediate data can be copied sooner
  • Reducer gets keys in sorted order
  • Keys not sorted across reducers
  • Global sort requires 1 reducer or smart
  partitioning


                                               34
MapReduce - jobs and tasks




                             35
MapReduce - jobs and tasks
  • Job: a user-submitted map and reduce
  implementation to apply to a data set




                                           35
MapReduce - jobs and tasks
  • Job: a user-submitted map and reduce
  implementation to apply to a data set
  • Task: a single mapper or reducer task




                                            35
MapReduce - jobs and tasks
  • Job: a user-submitted map and reduce
  implementation to apply to a data set
  • Task: a single mapper or reducer task
    • Failed tasks get retried automatically




                                               35
MapReduce - jobs and tasks
  • Job: a user-submitted map and reduce
  implementation to apply to a data set
  • Task: a single mapper or reducer task
    • Failed tasks get retried automatically
    • Tasks run local to their data, ideally




                                               35
MapReduce - jobs and tasks
  • Job: a user-submitted map and reduce
  implementation to apply to a data set
  • Task: a single mapper or reducer task
    • Failed tasks get retried automatically
    • Tasks run local to their data, ideally
  • JobTracker (JT) manages job submission and
  task delegation




                                                 35
MapReduce - jobs and tasks
  • Job: a user-submitted map and reduce
  implementation to apply to a data set
  • Task: a single mapper or reducer task
    • Failed tasks get retried automatically
    • Tasks run local to their data, ideally
  • JobTracker (JT) manages job submission and
  task delegation
  • TaskTrackers (TT) ask for work and execute
  tasks

                                                 35
MapReduce architecture
     Client                   Master

     Job                    JobTracker




              Slave node    Slave node    Slave node

              TaskTracker   TaskTracker   TaskTracker



                 Task          Task          Task



                                                        36
What about failed tasks?




  Credit: http://www.flickr.com/photos/phobia/2308371224/   37
What about failed tasks?


 • Tasks will fail




  Credit: http://www.flickr.com/photos/phobia/2308371224/   37
What about failed tasks?


 • Tasks will fail
 • JT will retry failed tasks up to N attempts




  Credit: http://www.flickr.com/photos/phobia/2308371224/   37
What about failed tasks?


 • Tasks will fail
 • JT will retry failed tasks up to N attempts
 • After N failed attempts for a task, job fails




  Credit: http://www.flickr.com/photos/phobia/2308371224/   37
What about failed tasks?


 • Tasks will fail
 • JT will retry failed tasks up to N attempts
 • After N failed attempts for a task, job fails
 • Some tasks are slower than other




  Credit: http://www.flickr.com/photos/phobia/2308371224/   37
What about failed tasks?


 • Tasks will fail
 • JT will retry failed tasks up to N attempts
 • After N failed attempts for a task, job fails
 • Some tasks are slower than other
 • Speculative execution is JT starting up
 multiple of the same task


  Credit: http://www.flickr.com/photos/phobia/2308371224/   37
What about failed tasks?


 • Tasks will fail
 • JT will retry failed tasks up to N attempts
 • After N failed attempts for a task, job fails
 • Some tasks are slower than other
 • Speculative execution is JT starting up
 multiple of the same task
 • First one to complete wins, other is killed

  Credit: http://www.flickr.com/photos/phobia/2308371224/   37
MapReduce data locality




                          38
MapReduce data locality


  • Move computation to the data




                                   38
MapReduce data locality


  • Move computation to the data
  • Moving data between nodes has a cost




                                           38
MapReduce data locality


  • Move computation to the data
  • Moving data between nodes has a cost
  • MapReduce tries to schedule tasks on nodes
  with the data




                                                 38
MapReduce data locality


  • Move computation to the data
  • Moving data between nodes has a cost
  • MapReduce tries to schedule tasks on nodes
  with the data
  • When not possible TT has to fetch data from DN




                                                     38
MapReduce - Java API
• Mapper:
  void map(WritableComparable key,
           Writable value,
           OutputCollector output,
           Reporter reporter)

• Reducer:
   void reduce(WritableComparable key,
               Iterator values,
               OutputCollector output,
               Reporter reporter)

                                         39
MapReduce - Java API




                       40
MapReduce - Java API
• Writable




                       40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface




                                   40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc




                                               40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc
• WritableComparable




                                               40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc
• WritableComparable
      • Writable classes implement WritableComparable




                                                        40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc
• WritableComparable
      • Writable classes implement WritableComparable
• OutputCollector




                                                        40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc
• WritableComparable
      • Writable classes implement WritableComparable
• OutputCollector
      • Class that collects keys and values




                                                        40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc
• WritableComparable
      • Writable classes implement WritableComparable
• OutputCollector
      • Class that collects keys and values
• Reporter




                                                        40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc
• WritableComparable
      • Writable classes implement WritableComparable
• OutputCollector
      • Class that collects keys and values
• Reporter
      • Reports progress, updates counters




                                                        40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc
• WritableComparable
      • Writable classes implement WritableComparable
• OutputCollector
      • Class that collects keys and values
• Reporter
      • Reports progress, updates counters
• InputFormat




                                                        40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc
• WritableComparable
      • Writable classes implement WritableComparable
• OutputCollector
      • Class that collects keys and values
• Reporter
      • Reports progress, updates counters
• InputFormat
      • Reads data and provide InputSplits




                                                        40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc
• WritableComparable
      • Writable classes implement WritableComparable
• OutputCollector
      • Class that collects keys and values
• Reporter
      • Reports progress, updates counters
• InputFormat
      • Reads data and provide InputSplits
      • Examples: TextInputFormat, KeyValueTextInputFormat




                                                             40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc
• WritableComparable
      • Writable classes implement WritableComparable
• OutputCollector
      • Class that collects keys and values
• Reporter
      • Reports progress, updates counters
• InputFormat
      • Reads data and provide InputSplits
      • Examples: TextInputFormat, KeyValueTextInputFormat
• OutputFormat


                                                             40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc
• WritableComparable
      • Writable classes implement WritableComparable
• OutputCollector
      • Class that collects keys and values
• Reporter
      • Reports progress, updates counters
• InputFormat
      • Reads data and provide InputSplits
      • Examples: TextInputFormat, KeyValueTextInputFormat
• OutputFormat
      • Writes data

                                                             40
MapReduce - Java API
• Writable
      • Hadoop wrapper interface
      • Text, IntWritable, LongWritable, etc
• WritableComparable
      • Writable classes implement WritableComparable
• OutputCollector
      • Class that collects keys and values
• Reporter
      • Reports progress, updates counters
• InputFormat
      • Reads data and provide InputSplits
      • Examples: TextInputFormat, KeyValueTextInputFormat
• OutputFormat
      • Writes data
      • Examples: TextOutputFormat, SequenceFileOutputFormat
                                                               40
MapReduce - Counters are...




                              41
MapReduce - Counters are...
• A distributed count of events during a job




                                               41
MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging




                                                  41
MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend




                                                  41
MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend




                                                  41
MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend

• Bad:




                                                  41
MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend

• Bad:
System.out.println(“Couldn’t parse value”);




                                                  41
MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend

• Bad:
System.out.println(“Couldn’t parse value”);
• Good:



                                                  41
MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend

• Bad:
System.out.println(“Couldn’t parse value”);
• Good:
reporter.incrCounter(BadParseEnum, 1L);


                                                  41
MapReduce - word count mapper
public static class Map extends MapReduceBase
       implements Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
                     OutputCollector<Text, IntWritable> output,
                     Reporter reporter) throws IOException {

        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
           word.set(tokenizer.nextToken());
           output.collect(word, one);
        }
    }
}
                                                                  42
MapReduce - word count reducer
public static class Reduce extends MapReduceBase implements
       Reducer<Text, IntWritable, Text, IntWritable> {

        public void reduce(Text key,
                     Iterator<IntWritable> values,
                     OutputCollector<Text, IntWritable> output,
                     Reporter reporter) throws IOException {
          int sum = 0;
          while (values.hasNext()) {
             sum += values.next().get();
          }
          output.collect(key, new IntWritable(sum));
    }
}


                                                              43
MapReduce - word count main
  public static void main(String[] args) throws Exception {
       JobConf conf = new JobConf(WordCount.class);
       conf.setJobName("wordcount");

       conf.setOutputKeyClass(Text.class);
       conf.setOutputValueClass(IntWritable.class);

       conf.setMapperClass(Map.class);
       conf.setCombinerClass(Reduce.class);
       conf.setReducerClass(Reduce.class);

       conf.setInputFormat(TextInputFormat.class);
       conf.setOutputFormat(TextOutputFormat.class);

       FileInputFormat.setInputPaths(conf, new Path(args[0]));
       FileOutputFormat.setOutputPath(conf, new Path(args[1]));

       JobClient.runJob(conf);
  }

                                                                  44
MapReduce - running a job


• To run word count, add files to HDFS and do:

$ bin/hadoop jar wordcount.jar
org.myorg.WordCount input_dir output_dir




                                                45
MapReduce is good for...




                           46
MapReduce is good for...


  • Embarrassingly parallel algorithms




                                         46
MapReduce is good for...


  • Embarrassingly parallel algorithms
  • Summing, grouping, filtering, joining




                                           46
MapReduce is good for...


  • Embarrassingly parallel algorithms
  • Summing, grouping, filtering, joining
  • Off-line batch jobs on massive data sets




                                               46
MapReduce is good for...


  • Embarrassingly parallel algorithms
  • Summing, grouping, filtering, joining
  • Off-line batch jobs on massive data sets
  • Analyzing an entire large dataset




                                               46
MapReduce is ok for...




                         47
MapReduce is ok for...



  • Iterative jobs (i.e., graph algorithms)




                                              47
MapReduce is ok for...



  • Iterative jobs (i.e., graph algorithms)
     • Each iteration must read/write data to disk




                                                     47
MapReduce is ok for...



  • Iterative jobs (i.e., graph algorithms)
     • Each iteration must read/write data to disk
     • IO and latency cost of an iteration is high




                                                     47
MapReduce is not good for...




                               48
MapReduce is not good for...

  • Jobs that need shared state/coordination




                                               48
MapReduce is not good for...

  • Jobs that need shared state/coordination
    • Tasks are shared-nothing




                                               48
MapReduce is not good for...

  • Jobs that need shared state/coordination
    • Tasks are shared-nothing
    • Shared-state requires scalable state store




                                                   48
MapReduce is not good for...

  • Jobs that need shared state/coordination
    • Tasks are shared-nothing
    • Shared-state requires scalable state store
  • Low-latency jobs




                                                   48
MapReduce is not good for...

  • Jobs that need shared state/coordination
    • Tasks are shared-nothing
    • Shared-state requires scalable state store
  • Low-latency jobs
  • Jobs on small datasets




                                                   48
MapReduce is not good for...

  • Jobs that need shared state/coordination
    • Tasks are shared-nothing
    • Shared-state requires scalable state store
  • Low-latency jobs
  • Jobs on small datasets
  • Finding individual records



                                                   48
Hadoop combined architecture
                      Master

                    JobTracker
                                           Backup

                    NameNode        SecondaryNameNode




      Slave node    Slave node    Slave node

      TaskTracker   TaskTracker   TaskTracker



      DataNode      DataNode      DataNode



                                                        49
NameNode UI
  • Tool for browsing HDFS




                             50
JobTracker UI
  • Tool to see running/completed/failed jobs




                                                51
Running Hadoop




                 52
Running Hadoop

  • Multiple options




                       52
Running Hadoop

  • Multiple options
  • On your local machine (standalone or pseudo-
  distributed)




                                                   52
Running Hadoop

  • Multiple options
  • On your local machine (standalone or pseudo-
  distributed)
  • Local with a virtual machine




                                                   52
Running Hadoop

  • Multiple options
  • On your local machine (standalone or pseudo-
  distributed)
  • Local with a virtual machine
  • On the cloud (i.e. Amazon EC2)




                                                   52
Running Hadoop

  • Multiple options
  • On your local machine (standalone or pseudo-
  distributed)
  • Local with a virtual machine
  • On the cloud (i.e. Amazon EC2)
  • In your own datacenter



                                                   52
Cloudera VM




              53
Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded




                                                     53
Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded
• Great tool for learning Hadoop




                                                     53
Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded
• Great tool for learning Hadoop
• Eases the pain of downloading/installing




                                                     53
Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded
• Great tool for learning Hadoop
• Eases the pain of downloading/installing
• Pre-loaded with sample data and jobs




                                                     53
Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded
• Great tool for learning Hadoop
• Eases the pain of downloading/installing
• Pre-loaded with sample data and jobs
• Documented tutorials




                                                     53
Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded
• Great tool for learning Hadoop
• Eases the pain of downloading/installing
• Pre-loaded with sample data and jobs
• Documented tutorials
• VM: https://ccp.cloudera.com/display/SUPPORT/
Cloudera's+Hadoop+Demo+VM+for+CDH4




                                                     53
Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded
• Great tool for learning Hadoop
• Eases the pain of downloading/installing
• Pre-loaded with sample data and jobs
• Documented tutorials
• VM: https://ccp.cloudera.com/display/SUPPORT/
Cloudera's+Hadoop+Demo+VM+for+CDH4
• Tutorial: https://ccp.cloudera.com/display/SUPPORT/
Hadoop+Tutorial

                                                        53
Twitter Analytics
  and Hadoop


                    54
Multiple teams use Hadoop


  • Analytics
  • Revenue
  • Personalization & Recommendations
  • Growth




                                        55
Twitter Analytics data flow

                                       Production Hosts
                  Log                                      Application
                events                                     Data
                         Scribe
                         Aggregators

  Third Party
                                                                     Social graph
   Imports                        HDFS                    MySQL/     Tweets
                                                          Gizzard    User profiles
                     Staging Hadoop Cluster




                Main Hadoop DW           HBase                                      Analytics
                                                           Vertica
                                                                                    Web Tools




                                                            MySQL
                                                                                                56
Twitter Analytics data flow

                                       Production Hosts
                  Log                                            Application
                events                                           Data
                         Scribe
                         Aggregators

  Third Party
                                                                           Social graph
   Imports                        HDFS                          MySQL/     Tweets
                                                                Gizzard    User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler



                            Log
                            Mover



                Main Hadoop DW           HBase                                            Analytics
                                                                 Vertica
                                                                                          Web Tools




                                                                  MySQL
                                                                                                      56
Twitter Analytics data flow

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log
                            Mover



                Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane


                                                                       MySQL
                                                                                                           56
Twitter Analytics data flow

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                           56
Twitter Analytics data flow

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log                                                     Rasvelg
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                           56
Twitter Analytics data flow

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed                                              Analysts
                     Staging Hadoop Cluster       Crawler                                                  Engineers
       Crane                                                                                               PMs
                                                            Crane                                          Sales
                                                                       Crane
                            Log                                                     Rasvelg
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                                 56
Twitter Analytics data flow

                                             Production Hosts
                        Log                                                 Application
                      events                                                Data
                               Scribe
                               Aggregators

       Third Party
                                                                                      Social graph
        Imports                         HDFS                              MySQL/      Tweets
                                                                          Gizzard     User profiles
                                                        Distributed                                              Analysts
                           Staging Hadoop Cluster       Crawler                                                  Engineers
              Crane                                                                                              PMs
                                                                  Crane                                          Sales
                                                                             Crane
                                  Log                                                     Rasvelg
HCatalog                          Mover


                                                                 Oink
           Oink       Main Hadoop DW           HBase                                                 Analytics
                                                                            Vertica
                                                                                                     Web Tools
                                                                Crane

                                                                          Crane
                                                                Crane

                                                       Oink
                                                                             MySQL
                                                                                                                       56
Example: active users

  Production Hosts




                     Main Hadoop DW




       MySQL/                                  Analytics
       Gizzard                        MySQL   Dashboards
                          Vertica




                                                           57
Example: active users
                                                                       Job DAG




                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)

                         ib   e   web_events
                     Scr
                                                      Main Hadoop DW
                 Scr
                        ibe       sms_events




       MySQL/                                                                             Analytics
       Gizzard                                                                   MySQL   Dashboards
                                                           Vertica




                                                                                                      57
Example: active users
                                                                       Job DAG




                                                                             Oink
                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct




       MySQL/                                                                                  Analytics
       Gizzard                                                                      MySQL     Dashboards
                                                           Vertica




                                                                                                           57
Example: active users
                                                                       Job DAG




                                                                             Oink     Oink
                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct

                                                             Oink
                                                             user_sessions



       MySQL/                                                                                  Analytics
       Gizzard                                                                      MySQL     Dashboards
                                                           Vertica




                                                                                                           57
Example: active users
                                                                       Job DAG




                                                                             Oink     Oink
                                                                 Log mover
  Production Hosts
                                                                                     Crane
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct

                                                             Oink
                                                             user_sessions



       MySQL/                         Crane                                                    Analytics
       Gizzard                                                                      MySQL     Dashboards
                                   user_profiles            Vertica




                                                                                                           57
Example: active users
                                                                            Job DAG




                                                                                   Oink     Oink
                                                                      Log mover
  Production Hosts
                                                                                           Crane
                                   Log mover                                                       Rasvelg
                              (via staging cluster)
                                                                                      Oink/Pig
                         ibe      web_events
                     Scr                                                              Cleanse
                                                       Main Hadoop DW                 Filter
                                                                                      Transform
                 Scr                                                                  Geo lookup
                        ibe       sms_events                                          Union
                                                                                      Distinct

                                                                   Oink
                                                                   user_sessions



       MySQL/                         Crane                                                                   Analytics
       Gizzard                                                                            MySQL              Dashboards
                                   user_profiles                 Vertica



                                                      Rasvelg
                                                      Join,
                                                      Join Group, Count
                                                      Aggregations:
                                                      - active_by_geo
                                                      - active_by_device
                                                      - active_by_client
                                                      ...                                                                 57
Example: active users
                                                                             Job DAG




                                                                                    Oink     Oink
                                                                                                           ...
                                                                      Log mover
  Production Hosts
                                                                                            Crane
                                   Log mover                                                          Rasvelg Crane
                              (via staging cluster)
                                                                                         Oink/Pig
                         ibe      web_events
                     Scr                                                                 Cleanse
                                                       Main Hadoop DW                    Filter
                                                                                         Transform
                 Scr                                                                     Geo lookup
                        ibe       sms_events                                             Union
                                                                                         Distinct

                                                                   Oink
                                                                   user_sessions



       MySQL/                         Crane                                 Crane                             Analytics
       Gizzard                                                                             MySQL             Dashboards
                                   user_profiles                 Vertica    active_by_*



                                                      Rasvelg
                                                      Join,
                                                      Join Group, Count
                                                      Aggregations:
                                                      - active_by_geo
                                                      - active_by_device
                                                      - active_by_client
                                                      ...                                                                 57
Credits



  • Data-Intensive Information Processing Applications ―
  Session #1, Jimmy Lin, University of Maryland




                                                           58
Questions?

 Bill Graham - @billgraham




                             59

More Related Content

What's hot

Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka Edureka!
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnMike Frampton
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Mini Curso Web Services com PHP
Mini Curso Web Services com PHPMini Curso Web Services com PHP
Mini Curso Web Services com PHPelliando dias
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDFNarni Rajesh
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
17-NoSQL.pptx
17-NoSQL.pptx17-NoSQL.pptx
17-NoSQL.pptxlevichan1
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 

What's hot (20)

Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Mini Curso Web Services com PHP
Mini Curso Web Services com PHPMini Curso Web Services com PHP
Mini Curso Web Services com PHP
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Restful web services ppt
Restful web services pptRestful web services ppt
Restful web services ppt
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
17-NoSQL.pptx
17-NoSQL.pptx17-NoSQL.pptx
17-NoSQL.pptx
 
Pig latin
Pig latinPig latin
Pig latin
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Spark overview
Spark overviewSpark overview
Spark overview
 

Viewers also liked

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Cracking the Facebook Coding Interview
Cracking the Facebook Coding InterviewCracking the Facebook Coding Interview
Cracking the Facebook Coding InterviewGayle McDowell
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Mastering The Fourth Industrial Revolution
Mastering The Fourth Industrial Revolution Mastering The Fourth Industrial Revolution
Mastering The Fourth Industrial Revolution Monty C. M. Metzger
 
GMAT Math Flashcards
GMAT Math FlashcardsGMAT Math Flashcards
GMAT Math FlashcardsGMAT Prep Now
 
Introduction to Hadoop Developer Training Webinar
Introduction to Hadoop Developer Training WebinarIntroduction to Hadoop Developer Training Webinar
Introduction to Hadoop Developer Training WebinarCloudera, Inc.
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsCloudera, Inc.
 
Special laws on children 8353, 9262, 9231, 7877, 7610, 920
Special laws on children   8353, 9262, 9231, 7877, 7610, 920Special laws on children   8353, 9262, 9231, 7877, 7610, 920
Special laws on children 8353, 9262, 9231, 7877, 7610, 920Omar Jacalne
 

Viewers also liked (20)

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Cracking the Facebook Coding Interview
Cracking the Facebook Coding InterviewCracking the Facebook Coding Interview
Cracking the Facebook Coding Interview
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Mastering The Fourth Industrial Revolution
Mastering The Fourth Industrial Revolution Mastering The Fourth Industrial Revolution
Mastering The Fourth Industrial Revolution
 
Banking in India
Banking in IndiaBanking in India
Banking in India
 
GMAT Math Flashcards
GMAT Math FlashcardsGMAT Math Flashcards
GMAT Math Flashcards
 
Introduction to Hadoop Developer Training Webinar
Introduction to Hadoop Developer Training WebinarIntroduction to Hadoop Developer Training Webinar
Introduction to Hadoop Developer Training Webinar
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 
Special laws on children 8353, 9262, 9231, 7877, 7610, 920
Special laws on children   8353, 9262, 9231, 7877, 7610, 920Special laws on children   8353, 9262, 9231, 7877, 7610, 920
Special laws on children 8353, 9262, 9231, 7877, 7610, 920
 
IBMiX: Reinventing The Wheel
IBMiX: Reinventing The Wheel IBMiX: Reinventing The Wheel
IBMiX: Reinventing The Wheel
 

Similar to Intro To Hadoop

Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightNilesh Gule
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudRightScale
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopMedia Gorod
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 

Similar to Intro To Hadoop (20)

Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the Cloud
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Hadoop
HadoopHadoop
Hadoop
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 

Recently uploaded

How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...Nguyen Thanh Tu Collection
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...Nguyen Thanh Tu Collection
 
DBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdfDBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdfChristalin Nelson
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPCeline George
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Shark introduction Morphology and its behaviour characteristics
Shark introduction Morphology and its behaviour characteristicsShark introduction Morphology and its behaviour characteristics
Shark introduction Morphology and its behaviour characteristicsArubSultan
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfChristalin Nelson
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6Vanessa Camilleri
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
The role of Geography in climate education: science and active citizenship
The role of Geography in climate education: science and active citizenshipThe role of Geography in climate education: science and active citizenship
The role of Geography in climate education: science and active citizenshipKarl Donert
 
DiskStorage_BasicFileStructuresandHashing.pdf
DiskStorage_BasicFileStructuresandHashing.pdfDiskStorage_BasicFileStructuresandHashing.pdf
DiskStorage_BasicFileStructuresandHashing.pdfChristalin Nelson
 

Recently uploaded (20)

Spearman's correlation,Formula,Advantages,
Spearman's correlation,Formula,Advantages,Spearman's correlation,Formula,Advantages,
Spearman's correlation,Formula,Advantages,
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN THEO CÂU CHO HỌC SINH LỚP 12 ĐỂ ĐẠT ĐIỂM 5+ THI TỐT NGHIỆP THPT ...
 
DBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdfDBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdf
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERP
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Shark introduction Morphology and its behaviour characteristics
Shark introduction Morphology and its behaviour characteristicsShark introduction Morphology and its behaviour characteristics
Shark introduction Morphology and its behaviour characteristics
 
CARNAVAL COM MAGIA E EUFORIA _
CARNAVAL COM MAGIA E EUFORIA            _CARNAVAL COM MAGIA E EUFORIA            _
CARNAVAL COM MAGIA E EUFORIA _
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdf
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
The role of Geography in climate education: science and active citizenship
The role of Geography in climate education: science and active citizenshipThe role of Geography in climate education: science and active citizenship
The role of Geography in climate education: science and active citizenship
 
DiskStorage_BasicFileStructuresandHashing.pdf
DiskStorage_BasicFileStructuresandHashing.pdfDiskStorage_BasicFileStructuresandHashing.pdf
DiskStorage_BasicFileStructuresandHashing.pdf
 

Intro To Hadoop

  • 1. Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
  • 2. Outline • What is Big Data? • Hadoop • HDFS • MapReduce • Twitter Analytics and Hadoop 2
  • 3. What is big data? • A bunch of data? • An industry? • An expertise? • A trend? • A cliche? 3
  • 4. Wikipedia big data In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. Source: http://en.wikipedia.org/wiki/Big_data 4
  • 5. How big is big? 5
  • 6. How big is big? • 2008: Google processes 20 PB a day 5
  • 7. How big is big? • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/ day 5
  • 8. How big is big? • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/ day • 2009: eBay has 6.5 PB user data + 50 TB/day 5
  • 9. How big is big? • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/ day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data 5
  • 10. How big is big? • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/ day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day 5
  • 11. That’s a lot of data Credit: http://www.flickr.com/photos/19779889@N00/1367404058/ 6
  • 12. So what? 7
  • 13. So what? s/data/knowledge/g 7
  • 14. No really, what do you do with it? 8
  • 15. No really, what do you do with it? • User behavior analysis 8
  • 16. No really, what do you do with it? • User behavior analysis • AB test analysis 8
  • 17. No really, what do you do with it? • User behavior analysis • AB test analysis • Ad targeting 8
  • 18. No really, what do you do with it? • User behavior analysis • AB test analysis • Ad targeting • Trending topics 8
  • 19. No really, what do you do with it? • User behavior analysis • AB test analysis • Ad targeting • Trending topics • User and topic modeling 8
  • 20. No really, what do you do with it? • User behavior analysis • AB test analysis • Ad targeting • Trending topics • User and topic modeling • Recommendations 8
  • 21. No really, what do you do with it? • User behavior analysis • AB test analysis • Ad targeting • Trending topics • User and topic modeling • Recommendations • And more... 8
  • 22. How to scale data? 9
  • 23. Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine 10
  • 24. Parallel processing is complicated Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
  • 25. Parallel processing is complicated • How do we assign tasks to workers? Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
  • 26. Parallel processing is complicated • How do we assign tasks to workers? • What if we have more tasks than slots? Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
  • 27. Parallel processing is complicated • How do we assign tasks to workers? • What if we have more tasks than slots? • What happens when tasks fail? Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
  • 28. Parallel processing is complicated • How do we assign tasks to workers? • What if we have more tasks than slots? • What happens when tasks fail? • How do you handle distributed synchronization? Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
  • 29. Data storage is not trivial • Data volumes are massive • Reliably storing PBs of data is challenging • Disk/hardware/network failures • Probability of failure event increases with number of machines For example: 1000 hosts, each with 10 disks a disk lasts 3 year how many failures per day? 12
  • 30. Hadoop cluster Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!) 13
  • 31. Hadoop 14
  • 32. Hadoop provides • Redundant, fault-tolerant data storage • Parallel computation framework • Job coordination http://hapdoop.apache.org 15
  • 33. Joy Credit: http://www.flickr.com/photos/spyndle/3480602438/ 16
  • 35. Hadoop origins • Hadoop is an open-source implementation based on GFS and MapReduce from Google 17
  • 36. Hadoop origins • Hadoop is an open-source implementation based on GFS and MapReduce from Google • Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System 17
  • 37. Hadoop origins • Hadoop is an open-source implementation based on GFS and MapReduce from Google • Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System • Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004 17
  • 38. Hadoop origins • Hadoop is an open-source implementation based on GFS and MapReduce from Google • Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System • Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004 17
  • 39. Hadoop Stack (Columnar Database) Pig Hive Cascading (Data Flow) (SQL) (Java) HBase MapReduce (Distributed Programming Framework) HDFS (Hadoop Distributed File System) 18
  • 40. HDFS 19
  • 42. HDFS is... • A distributed file system 20
  • 43. HDFS is... • A distributed file system • Redundant storage 20
  • 44. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware 20
  • 45. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures 20
  • 46. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files 20
  • 47. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts 20
  • 48. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts • The Hadoop Distributed File System 20
  • 49. HDFS - files and blocks 21
  • 50. HDFS - files and blocks • Files are stored as a collection of blocks 21
  • 51. HDFS - files and blocks • Files are stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) 21
  • 52. HDFS - files and blocks • Files are stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) 21
  • 53. HDFS - files and blocks • Files are stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about files and blocks 21
  • 54. HDFS - files and blocks • Files are stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about files and blocks • The SecondaryNameNode (SNN) holds a backup of the NN data 21
  • 55. HDFS - files and blocks • Files are stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about files and blocks • The SecondaryNameNode (SNN) holds a backup of the NN data • DataNodes (DN) store and serve blocks 21
  • 56. Replication • Multiple copies of a block are stored • Replication strategy: • Copy #1 on another node on same rack • Copy #2 on another node on different rack 22
  • 57. HDFS - writes Client Master Note: Write path for a File NameNode single block shown. Client writes multiple blocks in parallel. block Rack #1 Rack #2 Slave node Slave node Slave node DataNode DataNode DataNode Block Block Block 23
  • 58. HDFS - reads Client Master Client reads multiple blocks File NameNode in parallel and re-assembles into a file. block N block 1 block 2 Slave node Slave node Slave node DataNode DataNode DataNode Block Block Block 24
  • 59. What about DataNode failures? • DNs check in with the NN to report health • Upon failure NN orders DNs to replicate under- replicated blocks Credit: http://www.flickr.com/photos/18536761@N00/367661087/ 25
  • 60. MapReduce 26
  • 62. MapReduce is... • A programming model for expressing distributed computations at a massive scale 27
  • 63. MapReduce is... • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations 27
  • 64. MapReduce is... • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • An open-source implementation called Hadoop 27
  • 65. Typical large-data problem (Dean and Ghemawat, OSDI 2004) 28
  • 66. Typical large-data problem • Iterate over a large number of records (Dean and Ghemawat, OSDI 2004) 28
  • 67. Typical large-data problem • Iterate over a large number of records • Extract something of interest from each (Dean and Ghemawat, OSDI 2004) 28
  • 68. Typical large-data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results (Dean and Ghemawat, OSDI 2004) 28
  • 69. Typical large-data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results (Dean and Ghemawat, OSDI 2004) 28
  • 70. Typical large-data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output (Dean and Ghemawat, OSDI 2004) 28
  • 71. Typical large-data problem • Iterate over a large number of records Map • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output (Dean and Ghemawat, OSDI 2004) 28
  • 72. Typical large-data problem • Iterate over a large number of records Map • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results Reduce • Generate final output (Dean and Ghemawat, OSDI 2004) 28
  • 74. MapReduce paradigm • Implement two functions: 29
  • 75. MapReduce paradigm • Implement two functions: Map(k1, v1) -> list(k2, v2) 29
  • 76. MapReduce paradigm • Implement two functions: Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3) 29
  • 77. MapReduce paradigm • Implement two functions: Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else* 29
  • 78. MapReduce paradigm • Implement two functions: Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else* • Value with same key go to same reducer 29
  • 79. MapReduce Flow k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 30
  • 80. MapReduce - word count example function map(String name, String document): for each word w in document: emit(w, 1) function reduce(String word, Iterator partialCounts): totalCount = 0 for each count in partialCounts: totalCount += count emit(word, totalCount) 31
  • 81. MapReduce paradigm - part 2 32
  • 82. MapReduce paradigm - part 2 • There’s more! 32
  • 83. MapReduce paradigm - part 2 • There’s more! • Partioners decide what key goes to what reducer 32
  • 84. MapReduce paradigm - part 2 • There’s more! • Partioners decide what key goes to what reducer • partition(k’, numPartitions) -> partNumber 32
  • 85. MapReduce paradigm - part 2 • There’s more! • Partioners decide what key goes to what reducer • partition(k’, numPartitions) -> partNumber • Divides key space into parallel reducers chunks 32
  • 86. MapReduce paradigm - part 2 • There’s more! • Partioners decide what key goes to what reducer • partition(k’, numPartitions) -> partNumber • Divides key space into parallel reducers chunks • Default is hash-based 32
  • 87. MapReduce paradigm - part 2 • There’s more! • Partioners decide what key goes to what reducer • partition(k’, numPartitions) -> partNumber • Divides key space into parallel reducers chunks • Default is hash-based • Combiners can combine Mapper output before sending to reducer 32
  • 88. MapReduce paradigm - part 2 • There’s more! • Partioners decide what key goes to what reducer • partition(k’, numPartitions) -> partNumber • Divides key space into parallel reducers chunks • Default is hash-based • Combiners can combine Mapper output before sending to reducer • Reduce(k2, list(v2)) -> list(v3) 32
  • 89. MapReduce flow k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 8 3 6 reduce reduce reduce r1 s1 r2 s2 r3 s3 33
  • 91. MapReduce additional details • Reduce starts after all mappers complete 34
  • 92. MapReduce additional details • Reduce starts after all mappers complete • Mapper output gets written to disk 34
  • 93. MapReduce additional details • Reduce starts after all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner 34
  • 94. MapReduce additional details • Reduce starts after all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order 34
  • 95. MapReduce additional details • Reduce starts after all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order • Keys not sorted across reducers 34
  • 96. MapReduce additional details • Reduce starts after all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order • Keys not sorted across reducers • Global sort requires 1 reducer or smart partitioning 34
  • 97. MapReduce - jobs and tasks 35
  • 98. MapReduce - jobs and tasks • Job: a user-submitted map and reduce implementation to apply to a data set 35
  • 99. MapReduce - jobs and tasks • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task 35
  • 100. MapReduce - jobs and tasks • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically 35
  • 101. MapReduce - jobs and tasks • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally 35
  • 102. MapReduce - jobs and tasks • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally • JobTracker (JT) manages job submission and task delegation 35
  • 103. MapReduce - jobs and tasks • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally • JobTracker (JT) manages job submission and task delegation • TaskTrackers (TT) ask for work and execute tasks 35
  • 104. MapReduce architecture Client Master Job JobTracker Slave node Slave node Slave node TaskTracker TaskTracker TaskTracker Task Task Task 36
  • 105. What about failed tasks? Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
  • 106. What about failed tasks? • Tasks will fail Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
  • 107. What about failed tasks? • Tasks will fail • JT will retry failed tasks up to N attempts Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
  • 108. What about failed tasks? • Tasks will fail • JT will retry failed tasks up to N attempts • After N failed attempts for a task, job fails Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
  • 109. What about failed tasks? • Tasks will fail • JT will retry failed tasks up to N attempts • After N failed attempts for a task, job fails • Some tasks are slower than other Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
  • 110. What about failed tasks? • Tasks will fail • JT will retry failed tasks up to N attempts • After N failed attempts for a task, job fails • Some tasks are slower than other • Speculative execution is JT starting up multiple of the same task Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
  • 111. What about failed tasks? • Tasks will fail • JT will retry failed tasks up to N attempts • After N failed attempts for a task, job fails • Some tasks are slower than other • Speculative execution is JT starting up multiple of the same task • First one to complete wins, other is killed Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
  • 113. MapReduce data locality • Move computation to the data 38
  • 114. MapReduce data locality • Move computation to the data • Moving data between nodes has a cost 38
  • 115. MapReduce data locality • Move computation to the data • Moving data between nodes has a cost • MapReduce tries to schedule tasks on nodes with the data 38
  • 116. MapReduce data locality • Move computation to the data • Moving data between nodes has a cost • MapReduce tries to schedule tasks on nodes with the data • When not possible TT has to fetch data from DN 38
  • 117. MapReduce - Java API • Mapper: void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) • Reducer: void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) 39
  • 118. MapReduce - Java API 40
  • 119. MapReduce - Java API • Writable 40
  • 120. MapReduce - Java API • Writable • Hadoop wrapper interface 40
  • 121. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc 40
  • 122. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable 40
  • 123. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable • Writable classes implement WritableComparable 40
  • 124. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable • Writable classes implement WritableComparable • OutputCollector 40
  • 125. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable • Writable classes implement WritableComparable • OutputCollector • Class that collects keys and values 40
  • 126. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable • Writable classes implement WritableComparable • OutputCollector • Class that collects keys and values • Reporter 40
  • 127. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable • Writable classes implement WritableComparable • OutputCollector • Class that collects keys and values • Reporter • Reports progress, updates counters 40
  • 128. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable • Writable classes implement WritableComparable • OutputCollector • Class that collects keys and values • Reporter • Reports progress, updates counters • InputFormat 40
  • 129. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable • Writable classes implement WritableComparable • OutputCollector • Class that collects keys and values • Reporter • Reports progress, updates counters • InputFormat • Reads data and provide InputSplits 40
  • 130. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable • Writable classes implement WritableComparable • OutputCollector • Class that collects keys and values • Reporter • Reports progress, updates counters • InputFormat • Reads data and provide InputSplits • Examples: TextInputFormat, KeyValueTextInputFormat 40
  • 131. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable • Writable classes implement WritableComparable • OutputCollector • Class that collects keys and values • Reporter • Reports progress, updates counters • InputFormat • Reads data and provide InputSplits • Examples: TextInputFormat, KeyValueTextInputFormat • OutputFormat 40
  • 132. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable • Writable classes implement WritableComparable • OutputCollector • Class that collects keys and values • Reporter • Reports progress, updates counters • InputFormat • Reads data and provide InputSplits • Examples: TextInputFormat, KeyValueTextInputFormat • OutputFormat • Writes data 40
  • 133. MapReduce - Java API • Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc • WritableComparable • Writable classes implement WritableComparable • OutputCollector • Class that collects keys and values • Reporter • Reports progress, updates counters • InputFormat • Reads data and provide InputSplits • Examples: TextInputFormat, KeyValueTextInputFormat • OutputFormat • Writes data • Examples: TextOutputFormat, SequenceFileOutputFormat 40
  • 134. MapReduce - Counters are... 41
  • 135. MapReduce - Counters are... • A distributed count of events during a job 41
  • 136. MapReduce - Counters are... • A distributed count of events during a job • A way to indicate job metrics without logging 41
  • 137. MapReduce - Counters are... • A distributed count of events during a job • A way to indicate job metrics without logging • Your friend 41
  • 138. MapReduce - Counters are... • A distributed count of events during a job • A way to indicate job metrics without logging • Your friend 41
  • 139. MapReduce - Counters are... • A distributed count of events during a job • A way to indicate job metrics without logging • Your friend • Bad: 41
  • 140. MapReduce - Counters are... • A distributed count of events during a job • A way to indicate job metrics without logging • Your friend • Bad: System.out.println(“Couldn’t parse value”); 41
  • 141. MapReduce - Counters are... • A distributed count of events during a job • A way to indicate job metrics without logging • Your friend • Bad: System.out.println(“Couldn’t parse value”); • Good: 41
  • 142. MapReduce - Counters are... • A distributed count of events during a job • A way to indicate job metrics without logging • Your friend • Bad: System.out.println(“Couldn’t parse value”); • Good: reporter.incrCounter(BadParseEnum, 1L); 41
  • 143. MapReduce - word count mapper public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } 42
  • 144. MapReduce - word count reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } 43
  • 145. MapReduce - word count main public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } 44
  • 146. MapReduce - running a job • To run word count, add files to HDFS and do: $ bin/hadoop jar wordcount.jar org.myorg.WordCount input_dir output_dir 45
  • 147. MapReduce is good for... 46
  • 148. MapReduce is good for... • Embarrassingly parallel algorithms 46
  • 149. MapReduce is good for... • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining 46
  • 150. MapReduce is good for... • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets 46
  • 151. MapReduce is good for... • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset 46
  • 152. MapReduce is ok for... 47
  • 153. MapReduce is ok for... • Iterative jobs (i.e., graph algorithms) 47
  • 154. MapReduce is ok for... • Iterative jobs (i.e., graph algorithms) • Each iteration must read/write data to disk 47
  • 155. MapReduce is ok for... • Iterative jobs (i.e., graph algorithms) • Each iteration must read/write data to disk • IO and latency cost of an iteration is high 47
  • 156. MapReduce is not good for... 48
  • 157. MapReduce is not good for... • Jobs that need shared state/coordination 48
  • 158. MapReduce is not good for... • Jobs that need shared state/coordination • Tasks are shared-nothing 48
  • 159. MapReduce is not good for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store 48
  • 160. MapReduce is not good for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs 48
  • 161. MapReduce is not good for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets 48
  • 162. MapReduce is not good for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records 48
  • 163. Hadoop combined architecture Master JobTracker Backup NameNode SecondaryNameNode Slave node Slave node Slave node TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 49
  • 164. NameNode UI • Tool for browsing HDFS 50
  • 165. JobTracker UI • Tool to see running/completed/failed jobs 51
  • 167. Running Hadoop • Multiple options 52
  • 168. Running Hadoop • Multiple options • On your local machine (standalone or pseudo- distributed) 52
  • 169. Running Hadoop • Multiple options • On your local machine (standalone or pseudo- distributed) • Local with a virtual machine 52
  • 170. Running Hadoop • Multiple options • On your local machine (standalone or pseudo- distributed) • Local with a virtual machine • On the cloud (i.e. Amazon EC2) 52
  • 171. Running Hadoop • Multiple options • On your local machine (standalone or pseudo- distributed) • Local with a virtual machine • On the cloud (i.e. Amazon EC2) • In your own datacenter 52
  • 172. Cloudera VM 53
  • 173. Cloudera VM • Virtual machine with Hadoop and related technologies pre-loaded 53
  • 174. Cloudera VM • Virtual machine with Hadoop and related technologies pre-loaded • Great tool for learning Hadoop 53
  • 175. Cloudera VM • Virtual machine with Hadoop and related technologies pre-loaded • Great tool for learning Hadoop • Eases the pain of downloading/installing 53
  • 176. Cloudera VM • Virtual machine with Hadoop and related technologies pre-loaded • Great tool for learning Hadoop • Eases the pain of downloading/installing • Pre-loaded with sample data and jobs 53
  • 177. Cloudera VM • Virtual machine with Hadoop and related technologies pre-loaded • Great tool for learning Hadoop • Eases the pain of downloading/installing • Pre-loaded with sample data and jobs • Documented tutorials 53
  • 178. Cloudera VM • Virtual machine with Hadoop and related technologies pre-loaded • Great tool for learning Hadoop • Eases the pain of downloading/installing • Pre-loaded with sample data and jobs • Documented tutorials • VM: https://ccp.cloudera.com/display/SUPPORT/ Cloudera's+Hadoop+Demo+VM+for+CDH4 53
  • 179. Cloudera VM • Virtual machine with Hadoop and related technologies pre-loaded • Great tool for learning Hadoop • Eases the pain of downloading/installing • Pre-loaded with sample data and jobs • Documented tutorials • VM: https://ccp.cloudera.com/display/SUPPORT/ Cloudera's+Hadoop+Demo+VM+for+CDH4 • Tutorial: https://ccp.cloudera.com/display/SUPPORT/ Hadoop+Tutorial 53
  • 180. Twitter Analytics and Hadoop 54
  • 181. Multiple teams use Hadoop • Analytics • Revenue • Personalization & Recommendations • Growth 55
  • 182. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Staging Hadoop Cluster Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 56
  • 183. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 56
  • 184. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane MySQL 56
  • 185. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 56
  • 186. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 56
  • 187. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 56
  • 188. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg HCatalog Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 56
  • 189. Example: active users Production Hosts Main Hadoop DW MySQL/ Analytics Gizzard MySQL Dashboards Vertica 57
  • 190. Example: active users Job DAG Log mover Production Hosts Log mover (via staging cluster) ib e web_events Scr Main Hadoop DW Scr ibe sms_events MySQL/ Analytics Gizzard MySQL Dashboards Vertica 57
  • 191. Example: active users Job DAG Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct MySQL/ Analytics Gizzard MySQL Dashboards Vertica 57
  • 192. Example: active users Job DAG Oink Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Analytics Gizzard MySQL Dashboards Vertica 57
  • 193. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica 57
  • 194. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover Rasvelg (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 57
  • 195. Example: active users Job DAG Oink Oink ... Log mover Production Hosts Crane Log mover Rasvelg Crane (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica active_by_* Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 57
  • 196. Credits • Data-Intensive Information Processing Applications ― Session #1, Jimmy Lin, University of Maryland 58
  • 197. Questions? Bill Graham - @billgraham 59

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. TODO: Linked in paper, out VLDB paper\n
  6. TODO: Linked in paper, out VLDB paper\n
  7. TODO: Linked in paper, out VLDB paper\n
  8. TODO: Linked in paper, out VLDB paper\n
  9. TODO: Linked in paper, out VLDB paper\n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. 9 disks/day will fail\n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. explain about racks\n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. TODO: show data changing at phases\n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. \n
  109. \n
  110. \n
  111. \n
  112. \n
  113. \n
  114. \n
  115. \n
  116. \n
  117. \n
  118. \n
  119. \n
  120. \n
  121. \n
  122. \n
  123. \n
  124. \n
  125. \n
  126. \n
  127. \n
  128. \n
  129. \n
  130. \n
  131. \n
  132. \n
  133. \n
  134. \n
  135. \n
  136. \n
  137. \n
  138. \n
  139. \n
  140. \n
  141. \n
  142. \n
  143. \n
  144. \n
  145. \n
  146. \n
  147. \n
  148. \n
  149. \n
  150. \n
  151. \n
  152. \n
  153. \n
  154. \n
  155. \n
  156. \n
  157. \n
  158. \n
  159. \n
  160. \n
  161. \n
  162. \n
  163. \n
  164. \n
  165. \n
  166. \n
  167. \n
  168. \n
  169. \n
  170. \n
  171. \n
  172. VLDB paper too\n
  173. \n