Data-Intensive Computing for Text Analysis
                 CS395T / INF385T / LIN386M
            University of Texas at Austin, Fall 2011



                      Lecture 2
                  September 1, 2011

        Jason Baldridge                      Matt Lease
   Department of Linguistics           School of Information
  University of Texas at Austin    University of Texas at Austin
Jasonbaldridge at gmail dot com   ml at ischool dot utexas dot edu
Acknowledgments
      Course design and slides derived from
     Jimmy Lin’s cloud computing courses at
    the University of Maryland, College Park

Some figures courtesy of
• Chuck Lam’s Hadoop In Action (2011)
• Tom White’s Hadoop: The Definitive Guide,
  2nd Edition (2010)
Roots in Functional Programming




   Map       f   f   f   f   f




   Fold      g   g   g   g   g
Divide and Conquer

                 “Work”
                                       Partition


        w1         w2         w3

      “worker”   “worker”   “worker”


         r1         r2         r3




                 “Result”              Combine
MapReduce
“Big Ideas”
    Scale “out”, not “up”
        Limits of SMP and large shared-memory machines
    Move processing to the data
        Cluster have limited bandwidth
    Process data sequentially, avoid random access
        Seeks are expensive, disk throughput is reasonable
    Seamless scalability
        From the mythical man-month to the tradable machine-hour
Typical Large-Data Problem
          Iterate over a large number of records
          Compute something of interest from each
          Shuffle and sort intermediate results
          Aggregate intermediate results
          Generate final output


             Key idea: provide a functional abstraction for
             these two operations




(Dean and Ghemawat, OSDI 2004)
MapReduce Data Flow




Courtesy of Chuck Lam’s Hadoop In Action
(2011), pp. 45, 52
MapReduce “Runtime”
   Handles scheduling
       Assigns workers to map and reduce tasks
   Handles “data distribution”
       Moves processes to data
   Handles synchronization
       Gathers, sorts, and shuffles intermediate data
   Handles errors and faults
       Detects worker failures and restarts
   Built on a distributed file system
MapReduce
Programmers specify two functions
   map ( K1, V1 ) → list ( K2, V2 )
   reduce ( K2, list(V2) ) → list ( K3, V3)
Note correspondence of types map output → reduce input


Data Flow
      Input → “input splits”: each a sequence of logical (K1,V1) “records”
      Map
        • Each split processed by same map node
        • map invoked iteratively: once per record in the split
        • For each record processed, map may emit 0-N (K2,V2) pairs

      Reduce
        • reduce invoked iteratively for each ( K2, list(V2) ) intermediate value
        • For each processed, reduce may emit 0-N (K3,V3) pairs
      Each reducer’s output written to a persistent file in HDFS
Input File                                   Input File




                  InputSplit              InputSplit     InputSplit       InputSplit                InputSplit
  InputFormat




                RecordReader           RecordReader     RecordReader    RecordReader            RecordReader




                   Mapper                   Mapper         Mapper          Mapper                    Mapper




                Intermediates           Intermediates   Intermediates   Intermediates           Intermediates




Source: redrawn from a slide by Cloduera, cc-licensed
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 30

Data Flow




     Input → “input splits”: each a sequence of logical (K1,V1) “records”
     For each split, for each record, do map(K1,V1)       (multiple calls)
     Each map call may emit any number of (K2,V2) pairs             (0-N)
Run-time
     Groups all values with the same key into ( K2, list(V2) )
     Determines which reducer will process this
     Copies data across network as needed for reducer
     Ensures intra-node sort of keys processed by each reducer
       • No guarantee by default of inter-node total sort across reducers
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);
k1 v1   k2 v2   k3 v3    k4 v4   k5 v5    k6 v6




 map                map                    map                map


a 1    b 2        c 3     c 6           a 5   c 2           b 7   c 8
      Shuffle and Sort: aggregate values by keys
             a    1 5             b     2 7           c     2 3 6 8




         reduce             reduce                 reduce


          r1 s1                 r2 s2               r3 s3




                                                                        Courtesy of Chuck Lam’s Hadoop In
                                                                        Action (2011), pp. 45, 52
Partition
   Given:     map ( K1, V1 ) → list ( K2, V2 )
             reduce ( K2, list(V2) ) → list ( K3, V3)

partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]

       Each distinct key (with associated values) sent to a single reducer
         • Same reduce node may process multiple keys in separate reduce() calls

       Balances workload across reducers: equal number of keys to each
         • Default: simple hash of the key, e.g., hash(k’) mod N (# reducers)

       Customizable
         • Some keys require more computation than others
             • e.g. value skew, or key-specific computation performed
             • For skew, sampling can dynamically estimate distribution & set partition
         • Secondary/Tertiary sorting (e.g. bigrams or arbitrary n-grams)?
Secondary Sorting (Lin 57, White 241)
    How to output sorted bigrams (1st word, then list of 2nds)?
        What if we use word1 as the key, word 2 as the value?
        What if we use <first>--<second> as the key?
    Pattern
        Create a composite key of (first, second)
        Define a Key Comparator based on both words
          • This will produce the sort order we want (aa ab ac ba bb bc ca cb…)
        Define a partition function based only on first word
          • All bigrams with the same first word go to same reducer
          • How do you know when the first word changes across invocations?
        Preserve state in the reducer across invocations
          • Will be called separately for each bigram, but we want to remember
            the current first word across bigrams seen
        Hadoop also provides Group Comparator
Combine
   Given:      map ( K1, V1 ) → list ( K2, V2 )
             reduce ( K2, list(V2) ) → list ( K3, V3)

combine ( K2, list(V2) ) → list ( K2, V2 )

   Optional optimization
       Local aggregation to reduce network traffic
       No guarantee it will be used, how many times it will be called
       Semantics of program cannot depend on its use
   Signature: same input as reduce, same output as map
       Combine may be run repeatedly on its own output
       Lin: Associative & Commutative  combiner = reducer
         • See next slide
Functional Properties
    Associative: f( a, f(b,c) ) = f( f(a,b), c )
        Grouping of operations doesn’t matter
        YES: Addition, multiplication, concatenation
        NO: division, subtraction, NAND
        NAND(1, NAND(1,0)) = 0 != 1 = NAND( NAND(1,0), 0 )
    Commutative: f(a,b) = f(b,a)
        Ordering of arguments doesn’t matter
        YES: addition, multiplication, NAND
        NO: division, subtraction, concatenation
        Concatenate(“a,”b”) != concatenate(“b”,a”)
    Distributive
        White (p. 32) and Lam (p. 84) mention with regard to combiners
        But really, go with associative + commutative in Lin (pp. 20, 27)
k1 v1   k2 v2   k3 v3    k4 v4     k5 v5      k6 v6




  map                   map                    map                   map


a 1    b 2           c 3     c 6           a 5     c 2             b 7   c 8

 combine              combine               combine                 combine



a 1    b 2                 c 9             a 5     c 2             b 7   c 8

 partition            partition               partition             partition

      Shuffle and Sort: aggregate values by keys
               a     1 5             b     2 7               c     2 9 8 8
                                                                     3 6




         reduce                   reduce                  reduce


             r1 s1                 r2 s2                   r3 s3
User
                                                                     Program

                                                                         (1) submit


                                                                     Master

                                                   (2) schedule map        (2) schedule reduce


                                          worker
                     split 0
                                                                                                      (6) write   output
                     split 1                                            (5) remote read     worker
                               (3) read                                                                            file 0
                     split 2                       (4) local write
                                          worker
                     split 3
                     split 4                                                                                      output
                                                                                            worker
                                                                                                                   file 1

                                          worker


                     Input                 Map            Intermediate files                 Reduce               Output
                      files               phase             (on local disk)                   phase                files




Adapted from (Dean and Ghemawat, OSDI 2004)
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178


    Shuffle and 2 Sorts




   As map emits values, local sorting
    runs in tandem (1st sort)
   Combine is optionally called
    0..N times for local aggregation
    on sorted (K2, list(V2)) tuples (more sorting of output)
   Partition determines which (logical) reducer Rj each key will go to
   Node’s TaskTracker tells JobTracker it has keys for Rj
   JobTracker determines node to run Rj based on data locality
   When local map/combine/sort finishes, sends data to Rj’s node
   Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)
   For each (K, list(V)) tuple in merged output, call reduce(…)
Distributed File System
    Don’t move data… move computation to the data!
        Store data on the local disks of nodes in the cluster
        Start up the workers on the node that has the data local
    Why?
        Not enough RAM to hold all the data in memory
        Disk access is slow, but disk throughput is reasonable
    A distributed file system is the answer
        GFS (Google File System) for Google’s MapReduce
        HDFS (Hadoop Distributed File System) for Hadoop
GFS: Assumptions
           Commodity hardware over “exotic” hardware
                 Scale “out”, not “up”
           High component failure rates
                 Inexpensive commodity components fail all the time
           “Modest” number of huge files
                 Multi-gigabyte files are common, if not encouraged
           Files are write-once, mostly appended to
                 Perhaps concurrently
           Large streaming reads over random access
                 High sustained throughput over low latency




GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
GFS: Design Decisions
   Files stored as chunks
        Fixed size (64MB)
   Reliability through replication
        Each chunk replicated across 3+ chunkservers
   Single master to coordinate access, keep metadata
        Simple centralized management
   No data caching
        Little benefit due to large datasets, streaming reads
   Simplify the API
        Push some of the issues onto the client (e.g., data layout)

    HDFS = GFS clone (same basic ideas)
Basic Cluster Components
   1 “Manager” node (can be split onto 2 nodes)
       Namenode (NN)
       Jobtracker (JT)
   1-N “Worker” nodes
       Tasktracker (TT)
       Datanode (DN)
   Optional Secondary Namenode
       Periodic backups of Namenode in case of failure
Hadoop Architecture




   Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 24-25
Namenode Responsibilities
   Managing the file system namespace:
       Holds file/directory structure, metadata, file-to-block mapping,
        access permissions, etc.
   Coordinating file operations:
       Directs clients to datanodes for reads and writes
       No data is moved through the namenode
   Maintaining overall health:
       Periodic communication with the datanodes
       Block re-replication and rebalancing
       Garbage collection
Putting everything together…


                            namenode                 job submission node


                    namenode daemon                         jobtracker




          tasktracker                     tasktracker                      tasktracker

       datanode daemon                 datanode daemon               datanode daemon

        Linux file system               Linux file system                Linux file system

                        …                               …                                …
          slave node                      slave node                       slave node
Anatomy of a Job
   MapReduce program in Hadoop = Hadoop job
       Jobs are divided into map and reduce tasks (+ more!)
       An instance of running a task is called a task attempt
       Multiple jobs can be composed into a workflow
   Job submission process
       Client (i.e., driver program) creates a job, configures it, and
        submits it to job tracker
       JobClient computes input splits (on client end)
       Job data (jar, configuration XML) are sent to JobTracker
       JobTracker puts job data in shared location, enqueues tasks
       TaskTrackers poll for tasks
       Off to the races…
Why have 1 API when you can have 2?
White pp. 25-27, Lam pp. 77-80
   Hadoop 0.19 and earlier had “old API”
   Hadoop 0.21 and forward has “new API”
   Hadoop 0.20 has both!
       Old API most stable, but deprecated
       Current books use old API predominantly, but discuss changes
         • Example code using new API available online from publisher
       Some old API classes/methods not yet ported to new API
       Cloud9 uses both, and you can too
Old API
   Mapper (interface)
       void map(K1 key, V1 value, OutputCollector<K2, V2> output,
        Reporter reporter)
       void configure(JobConf job)
       void close() throws IOException
   Reducer/Combiner
       void reduce(K2 key, Iterator<V2> values,
        OutputCollector<K3,V3> output, Reporter reporter)
       void configure(JobConf job)
       void close() throws IOException
   Partitioner
       void getPartition(K2 key, V2 value, int numPartitions)
New API
   org.apache.hadoop.mapred now deprecated; instead use
    org.apache.hadoop.mapreduce &
    org.apache.hadoop.mapreduce.lib
   Mapper, Reducer now abstract classes, not interfaces
   Use Context instead of OutputCollector and Reporter
       Context.write(), not OutputCollector.collect()
   Reduce takes value list as Iterable, not Iterator
       Can use java’s foreach syntax for iterating
   Can throw InterruptedException as well as IOException
   JobConf & JobClient replaced by Configuration & Job

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

  • 1.
    Data-Intensive Computing forText Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 2 September 1, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at Austin Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
  • 2.
    Acknowledgments Course design and slides derived from Jimmy Lin’s cloud computing courses at the University of Maryland, College Park Some figures courtesy of • Chuck Lam’s Hadoop In Action (2011) • Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
  • 3.
    Roots in FunctionalProgramming Map f f f f f Fold g g g g g
  • 4.
    Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine
  • 5.
  • 6.
    “Big Ideas”  Scale “out”, not “up”  Limits of SMP and large shared-memory machines  Move processing to the data  Cluster have limited bandwidth  Process data sequentially, avoid random access  Seeks are expensive, disk throughput is reasonable  Seamless scalability  From the mythical man-month to the tradable machine-hour
  • 7.
    Typical Large-Data Problem  Iterate over a large number of records  Compute something of interest from each  Shuffle and sort intermediate results  Aggregate intermediate results  Generate final output Key idea: provide a functional abstraction for these two operations (Dean and Ghemawat, OSDI 2004)
  • 8.
    MapReduce Data Flow Courtesyof Chuck Lam’s Hadoop In Action (2011), pp. 45, 52
  • 9.
    MapReduce “Runtime”  Handles scheduling  Assigns workers to map and reduce tasks  Handles “data distribution”  Moves processes to data  Handles synchronization  Gathers, sorts, and shuffles intermediate data  Handles errors and faults  Detects worker failures and restarts  Built on a distributed file system
  • 10.
    MapReduce Programmers specify twofunctions map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3) Note correspondence of types map output → reduce input Data Flow  Input → “input splits”: each a sequence of logical (K1,V1) “records”  Map • Each split processed by same map node • map invoked iteratively: once per record in the split • For each record processed, map may emit 0-N (K2,V2) pairs  Reduce • reduce invoked iteratively for each ( K2, list(V2) ) intermediate value • For each processed, reduce may emit 0-N (K3,V3) pairs  Each reducer’s output written to a persistent file in HDFS
  • 11.
    Input File Input File InputSplit InputSplit InputSplit InputSplit InputSplit InputFormat RecordReader RecordReader RecordReader RecordReader RecordReader Mapper Mapper Mapper Mapper Mapper Intermediates Intermediates Intermediates Intermediates Intermediates Source: redrawn from a slide by Cloduera, cc-licensed
  • 12.
    Courtesy of TomWhite’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 30 Data Flow  Input → “input splits”: each a sequence of logical (K1,V1) “records”  For each split, for each record, do map(K1,V1) (multiple calls)  Each map call may emit any number of (K2,V2) pairs (0-N) Run-time  Groups all values with the same key into ( K2, list(V2) )  Determines which reducer will process this  Copies data across network as needed for reducer  Ensures intra-node sort of keys processed by each reducer • No guarantee by default of inter-node total sort across reducers
  • 13.
    “Hello World”: WordCount map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum);
  • 14.
    k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 45, 52
  • 15.
    Partition  Given: map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3) partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]  Each distinct key (with associated values) sent to a single reducer • Same reduce node may process multiple keys in separate reduce() calls  Balances workload across reducers: equal number of keys to each • Default: simple hash of the key, e.g., hash(k’) mod N (# reducers)  Customizable • Some keys require more computation than others • e.g. value skew, or key-specific computation performed • For skew, sampling can dynamically estimate distribution & set partition • Secondary/Tertiary sorting (e.g. bigrams or arbitrary n-grams)?
  • 16.
    Secondary Sorting (Lin57, White 241)  How to output sorted bigrams (1st word, then list of 2nds)?  What if we use word1 as the key, word 2 as the value?  What if we use <first>--<second> as the key?  Pattern  Create a composite key of (first, second)  Define a Key Comparator based on both words • This will produce the sort order we want (aa ab ac ba bb bc ca cb…)  Define a partition function based only on first word • All bigrams with the same first word go to same reducer • How do you know when the first word changes across invocations?  Preserve state in the reducer across invocations • Will be called separately for each bigram, but we want to remember the current first word across bigrams seen  Hadoop also provides Group Comparator
  • 17.
    Combine  Given: map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3) combine ( K2, list(V2) ) → list ( K2, V2 )  Optional optimization  Local aggregation to reduce network traffic  No guarantee it will be used, how many times it will be called  Semantics of program cannot depend on its use  Signature: same input as reduce, same output as map  Combine may be run repeatedly on its own output  Lin: Associative & Commutative  combiner = reducer • See next slide
  • 18.
    Functional Properties  Associative: f( a, f(b,c) ) = f( f(a,b), c )  Grouping of operations doesn’t matter  YES: Addition, multiplication, concatenation  NO: division, subtraction, NAND  NAND(1, NAND(1,0)) = 0 != 1 = NAND( NAND(1,0), 0 )  Commutative: f(a,b) = f(b,a)  Ordering of arguments doesn’t matter  YES: addition, multiplication, NAND  NO: division, subtraction, concatenation  Concatenate(“a,”b”) != concatenate(“b”,a”)  Distributive  White (p. 32) and Lam (p. 84) mention with regard to combiners  But really, go with associative + commutative in Lin (pp. 20, 27)
  • 19.
    k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 8 3 6 reduce reduce reduce r1 s1 r2 s2 r3 s3
  • 20.
    User Program (1) submit Master (2) schedule map (2) schedule reduce worker split 0 (6) write output split 1 (5) remote read worker (3) read file 0 split 2 (4) local write worker split 3 split 4 output worker file 1 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase files Adapted from (Dean and Ghemawat, OSDI 2004)
  • 21.
    Courtesy of TomWhite’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178 Shuffle and 2 Sorts  As map emits values, local sorting runs in tandem (1st sort)  Combine is optionally called 0..N times for local aggregation on sorted (K2, list(V2)) tuples (more sorting of output)  Partition determines which (logical) reducer Rj each key will go to  Node’s TaskTracker tells JobTracker it has keys for Rj  JobTracker determines node to run Rj based on data locality  When local map/combine/sort finishes, sends data to Rj’s node  Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)  For each (K, list(V)) tuple in merged output, call reduce(…)
  • 22.
    Distributed File System  Don’t move data… move computation to the data!  Store data on the local disks of nodes in the cluster  Start up the workers on the node that has the data local  Why?  Not enough RAM to hold all the data in memory  Disk access is slow, but disk throughput is reasonable  A distributed file system is the answer  GFS (Google File System) for Google’s MapReduce  HDFS (Hadoop Distributed File System) for Hadoop
  • 23.
    GFS: Assumptions  Commodity hardware over “exotic” hardware  Scale “out”, not “up”  High component failure rates  Inexpensive commodity components fail all the time  “Modest” number of huge files  Multi-gigabyte files are common, if not encouraged  Files are write-once, mostly appended to  Perhaps concurrently  Large streaming reads over random access  High sustained throughput over low latency GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
  • 24.
    GFS: Design Decisions  Files stored as chunks  Fixed size (64MB)  Reliability through replication  Each chunk replicated across 3+ chunkservers  Single master to coordinate access, keep metadata  Simple centralized management  No data caching  Little benefit due to large datasets, streaming reads  Simplify the API  Push some of the issues onto the client (e.g., data layout) HDFS = GFS clone (same basic ideas)
  • 25.
    Basic Cluster Components  1 “Manager” node (can be split onto 2 nodes)  Namenode (NN)  Jobtracker (JT)  1-N “Worker” nodes  Tasktracker (TT)  Datanode (DN)  Optional Secondary Namenode  Periodic backups of Namenode in case of failure
  • 26.
    Hadoop Architecture  Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 24-25
  • 27.
    Namenode Responsibilities  Managing the file system namespace:  Holds file/directory structure, metadata, file-to-block mapping, access permissions, etc.  Coordinating file operations:  Directs clients to datanodes for reads and writes  No data is moved through the namenode  Maintaining overall health:  Periodic communication with the datanodes  Block re-replication and rebalancing  Garbage collection
  • 28.
    Putting everything together… namenode job submission node namenode daemon jobtracker tasktracker tasktracker tasktracker datanode daemon datanode daemon datanode daemon Linux file system Linux file system Linux file system … … … slave node slave node slave node
  • 29.
    Anatomy of aJob  MapReduce program in Hadoop = Hadoop job  Jobs are divided into map and reduce tasks (+ more!)  An instance of running a task is called a task attempt  Multiple jobs can be composed into a workflow  Job submission process  Client (i.e., driver program) creates a job, configures it, and submits it to job tracker  JobClient computes input splits (on client end)  Job data (jar, configuration XML) are sent to JobTracker  JobTracker puts job data in shared location, enqueues tasks  TaskTrackers poll for tasks  Off to the races…
  • 30.
    Why have 1API when you can have 2? White pp. 25-27, Lam pp. 77-80  Hadoop 0.19 and earlier had “old API”  Hadoop 0.21 and forward has “new API”  Hadoop 0.20 has both!  Old API most stable, but deprecated  Current books use old API predominantly, but discuss changes • Example code using new API available online from publisher  Some old API classes/methods not yet ported to new API  Cloud9 uses both, and you can too
  • 31.
    Old API  Mapper (interface)  void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)  void configure(JobConf job)  void close() throws IOException  Reducer/Combiner  void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter)  void configure(JobConf job)  void close() throws IOException  Partitioner  void getPartition(K2 key, V2 value, int numPartitions)
  • 32.
    New API  org.apache.hadoop.mapred now deprecated; instead use org.apache.hadoop.mapreduce & org.apache.hadoop.mapreduce.lib  Mapper, Reducer now abstract classes, not interfaces  Use Context instead of OutputCollector and Reporter  Context.write(), not OutputCollector.collect()  Reduce takes value list as Iterable, not Iterator  Can use java’s foreach syntax for iterating  Can throw InterruptedException as well as IOException  JobConf & JobClient replaced by Configuration & Job