Ordered Record Collection
Upcoming SlideShare
Loading in...5
×
 

Ordered Record Collection

on

  • 3,454 views

Chris Douglas, Yahoo!

Chris Douglas, Yahoo!

Statistics

Views

Total Views
3,454
Slideshare-icon Views on SlideShare
3,068
Embed Views
386

Actions

Likes
11
Downloads
99
Comments
0

4 Embeds 386

http://developer.yahoo.net 250
http://developer.yahoo.com 133
http://www.slideshare.net 2
http://reader.youdao.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Every presenter must include a slide like this one, and protocol demands that it contain no fewer than 5 inaccuracies

Ordered Record Collection Ordered Record Collection Presentation Transcript

  • Sort of Vinyl: Ordered Record Collection
    Chris Douglas
    01.18.2010
  • Obligatory MapReduce Flow Slide
    Split 2
    Map 2
    Combine*
    Reduce 1
    Split 1
    Map 1
    hdfs://host:8020/input/data
    hdfs://host:8020/output/data
    HDFS
    HDFS
    Combine*
    Reduce 1
    Split 0
    Map 0
    Combine*
  • Obligatory MapReduce Flow Slide
    Map Output Collection
    Split 2
    Map 2
    Combine*
    Reduce 1
    Split 1
    Map 1
    hdfs://host:8020/input/data
    hdfs://host:8020/output/data
    HDFS
    HDFS
    Combine*
    Reduce 1
    Split 0
    Map 0
    Combine*
  • Overview
    Hadoop (∞, 0.10)
    Hadoop [ 0.10, 0.17)
    Hadoop [0.17, 0.22]
    Lucene
    HADOOP-331
    HADOOP-2919
  • Overview
    Hadoop (∞, 0.10)
    Hadoop [ 0.10, 0.17)
    Hadoop [0.17, 0.22]
    Lucene
    HADOOP-331
    HADOOP-2919
    Cretaceous
    Jurassic
    Triassic
  • Awesome!
  • Problem Description
    map(K1,V1)
    *
    collect(K2,V2)
  • Problem Description
    p0  partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
  • Problem Description
    p0  partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
  • Problem Description
    p0  partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
  • Problem Description
    p0 partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
    key0
  • Problem Description
    p0 partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
    key0
  • Problem Description
    p0 partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
    key0
    val0
  • Problem Description
    p0 partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
    key0
    val0
  • Problem Description
    int
    p0 partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
    key0
    val0
    byte[]
    byte[]
  • Problem Description
    For all calls to collect(K2 keyn, V2 valn):
    • Store result of partition(K2 keyn, V2 valn)
    • Ordered set of write(byte[], int, int) for keyn
    • Ordered set of write(byte[], int, int) for valn
    Challenges:
    • Size of key/value unknown a priori
    • Records must be grouped for efficient fetch from reduce
    • Sort occurs after the records are serialized
  • Overview
    Hadoop (∞, 0.10)
    Hadoop [ 0.10, 0.17)
    Hadoop [0.17, 0.22]
    Lucene
    HADOOP-331
    HADOOP-2919
    Cretaceous
    Jurassic
    Triassic
  • Hadoop (∞, 0.10)
    p0 partition(key0,val0)
    map(K1,V1)
    *
    collect(K2,V2)
    collect(K2,V2)
    SequenceFile::Writer[p0].append(key0, val0)


  • Hadoop (∞, 0.10)
    p0 partition(key0,val0)
    map(K1,V1)
    *
    collect(K2,V2)
    collect(K2,V2)
    key0.write(localFS)
    SequenceFile::Writer[p0].append(key0, val0)
    val0.write(localFS)


  • Hadoop (∞, 0.10)
    p0 partition(key0,val0)
    map(K1,V1)
    *
    collect(K2,V2)
    collect(K2,V2)
    key0.write(localFS)
    SequenceFile::Writer[p0].append(key0, val0)
    val0.write(localFS)


  • Hadoop (∞, 0.10)
    Not necessarily true. SeqFile may buffer configurable amount of data to effect block compresion, stream buffering, etc.
    p0 partition(key0,val0)
    map(K1,V1)
    *
    collect(K2,V2)
    collect(K2,V2)
    key0.write(localFS)
    SequenceFile::Writer[p0].append(key0, val0)
    val0.write(localFS)


  • Hadoop (∞, 0.10)
    key0
    key1
    clone(key0, val0)
    map(K1,V1)
    key2
    *
    flush()
    collect(K2,V2)
    collect(K2,V2)
    reduce(keyn, val*)
    SequenceFile::Writer[p0].append(keyn’, valn’)

    p0 partition(key0,val0)

  • Hadoop (∞, 0.10)
    key0
    key1
    clone(key0, val0)
    map(K1,V1)
    key2
    *
    flush()
    collect(K2,V2)
    collect(K2,V2)
    reduce(keyn, val*)
    SequenceFile::Writer[p0].append(keyn’, valn’)

    p0 partition(key0,val0)

  • Hadoop (∞, 0.10)
    key0
    key1
    clone(key0, val0)
    map(K1,V1)
    key2
    *
    flush()
    collect(K2,V2)
    collect(K2,V2)
    reduce(keyn, val*)
    SequenceFile::Writer[p0].append(keyn’, valn’)

    p0 partition(key0,val0)

    Combiner may change the partition and ordering of input records. This is no longer supported
  • Hadoop (∞, 0.10)
    Reduce k
    Reduce 0

    TaskTracker

  • Hadoop (∞, 0.10)
    Reduce k
    Reduce 0

    TaskTracker

  • Hadoop (∞, 0.10)
    Reduce 0
    sort/merge  localFS

  • Hadoop (∞, 0.10)
    Pro:
    • Complexity of sort/merge encapsulated in SequenceFile, shared between MapTask and ReduceTask
    • Very versatile Combiner semantics (change sort order, partition)
    Con:
    • Copy/sort can take a long time for each reduce (lost opportunity to parallelize sort)
    • Job cleanup is expensive (e.g. 7k reducer job must delete 7k files per map on that TT)
    • Combiner is expensive to use and its memory usage is difficult to track
    • OOMExceptions from untracked memory in buffers, particularly when using compression (HADOOP-570)
  • Overview
    Hadoop (∞, 0.10)
    Hadoop [ 0.10, 0.17)
    Hadoop [0.17, 0.22]
    Lucene
    HADOOP-331
    HADOOP-2919
    Cretaceous
    Jurassic
    Triassic
  • Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
  • Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
  • Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
  • Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
  • Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
  • Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    Add memory used by all BufferSorter implementations and keyValBuffer. If spill threshold exceeded, then spill contents to disk
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
    Keep offset into buffer, length of key, value.
  • Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    *
    0
    1
    k-1
    k
    sortAndSpillToDisk()
    *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded
    0
  • Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    *
    0
    1
    k-1
    k
    sortAndSpillToDisk()
    *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded
    0
    K2.readFields(DataInput)
    V2.readFields(DataInput)
    SequenceFile::append(K2,V2)
  • Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
    *If defined, the combiner is now run during the spill, separately over each partition. Values emitted from the combiner are written directly to the output partition.
    0
    K2.readFields(DataInput)
    V2.readFields(DataInput)
    *
    << Combiner >>
    SequenceFile::append(K2,V2)
  • Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    *
    0
    1
    k-1
    k
    sortAndSpillToDisk()
    0
    1
  • Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
    0
    1


    k
  • Hadoop [0.10, 0.17)
    mergeParts()
    0
    0
    0
    1
    1
    1






    k
    k
    k
  • Hadoop [0.10, 0.17)
    mergeParts()
    0
    0
    0
    0
    1
    1
    1






    k
    k
    k
  • Hadoop [0.10, 0.17)
    Reduce 0
    0
    1

    TaskTracker


    k
    Reduce k
  • Hadoop [0.10, 0.17)
    Reduce 0
    0
    1

    TaskTracker


    k
    Reduce k
  • Hadoop [0.10, 0.17)
    Pro:
    • Distributes the sort/merge across all maps; reducer need only merge its inputs
    • Much more predictable memory footprint
    • Shared, in-memory buffer across all partitions w/ efficient sort
    • Combines over each spill, defined by memory usage, instead of record count
    • Running the combiner doesn’t require storing a clone of each record (fewer serializ.)
    • In 0.16, spill was made concurrent with collection (HADOOP-1965)
    Con:
    • Expanding buffers may impose a performance penalty; used memory calculated on every call to collect(K2,V2)
    • MergeSort copies indices on each level of recursion
    • Deserializing the key/value before appending to the SequenceFile is avoidable
    • Combiner weakened by requiring sort order and partition to remain consistent
    • Though tracked, BufferSort instances take non-negligible space (HADOOP-1698)
  • Overview
    Hadoop (∞, 0.10)
    Hadoop [ 0.10, 0.17)
    Hadoop [0.17, 0.22]
    Lucene
    HADOOP-331
    HADOOP-2919
    Cretaceous
    Jurassic
    Triassic
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    io.sort.mb * io.sort.record.percent

    io.sort.mb
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    KS.serialize(V2)
    Instead of explicitly tracking space used by record metadata, allocate a configurable amount of space at the beginning of the task
    io.sort.mb * io.sort.record.percent

    io.sort.mb
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    bufindex
    bufmark
    io.sort.mb * io.sort.record.percent
    kvstart
    kvend
    kvindex
    io.sort.mb
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    bufindex
    bufmark
    io.sort.mb * io.sort.record.percent
    kvstart
    kvend
    kvindex
    io.sort.mb
    kvoffsets
    kvindices
    Partition no longer implicitly tracked. Store (partition, keystart,valstart) for every record collected
    kvbuffer
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    kvstart
    kvend
    kvindex
    bufindex
    bufmark
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    kvstart
    kvend
    kvindex
    bufmark
    bufindex
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    kvstart
    kvend
    kvindex
    bufmark
    bufindex
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    kvstart
    kvend
    kvindex
    p0
    bufmark
    bufindex
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    kvstart
    kvend
    kvindex
    io.sort.spill.percent
    bufindex
    bufmark
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    kvstart
    kvend
    kvindex
    bufend
    bufindex
    bufmark
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    kvstart
    kvend
    kvindex
    bufindex
    bufmark
    bufend
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufindex
    bufmark
    kvstart
    kvindex
    kvend
    bufend
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufindex
    bufmark
    kvindex
    kvstart
    kvend
    bufstart
    bufend
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    Invalid segments in the serialization buffer are marked by bufvoid
    RawComparator interface requires that the key be contiguous in the byte[]
    bufmark
    bufvoid
    bufindex
    kvindex
    kvstart
    kvend
    bufstart
    bufend
  • Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufvoid
    bufmark
    bufindex
    kvindex
    kvstart
    kvend
    bufstart
    bufend
  • Hadoop [0.17, 0.22)
    Pro:
    • Predictable memory footprint, collection (though not spill) agnostic to number of reducers. Most memory used for the sort allocated upfront and maintained for the full task duration.
    • No resizing of buffers, copying of serialized record data or metadata
    • Uses SequenceFile::appendRaw to avoid deserialization/serialization pass
    • Effects record compression in-place (removed in 0.18 with improvements to intermediate data format HADOOP-2095)
    Other Performance Improvements
    • Improved performance, no metadata copying using QuickSort (HADOOP-3308)
    • Caching of spill indices (HADOOP-3638)
    • Run combiner during the merge (HADOOP-3226)
    • Improved locking and synchronization (HADOOP-{5664,3617})
    Con:
    • Complexity and new code responsible for several bugs in 0.17
    • (HADOOP-{3442,3550,3475,3603})
    • io.sort.record.percent is obscure, critical to performance, and awkward
    • While predictable, memory usage is arguably too restricted
    • Really? io.sort.record.percent? (MAPREDUCE-64)
  • Hadoop [0.22]
    bufstart
    bufend
    bufindex
    bufmark
    equator
    kvstart
    kvend
    kvindex
  • Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufindex
    bufmark
  • Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufmark
    bufindex
  • Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufmark
    bufindex
  • Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufmark
    bufindex
  • Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufmark
    bufindex
    p0
    kvoffsets and kvindices information interlaced into metadata blocks. The sort is effected in a manner identical to 0.17, but metadata is allocated per-record, rather than a priori
    (kvoffsets)
    (kvindices)
  • Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufindex
    bufmark
  • Hadoop [0.22]
    bufstart
    kvstart
    kvend
    bufend
    kvindex
    equator
    bufindex
    bufmark
  • Hadoop [0.22]
    bufstart
    kvstart
    kvend
    bufend
    bufindex
    bufmark
    kvindex
    equator
  • Hadoop [0.22]
    kvstart
    kvend
    bufstart
    bufend
    bufindex
    bufmark
    kvindex
    equator
  • Hadoop [0.22]
    bufindex
    bufmark
    kvindex
    equator
    bufstart
    bufend
    kvstart
    kvend
  • Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufindex
    bufmark
  • Hadoop [0.22]
    bufstart
    kvstart
    kvend
    kvindex
    bufindex
    bufmark
    bufend
    equator
  • Hadoop [0.22]
    bufindex
    kvindex
    kvstart
    kvend
    bufmark
    bufstart
    bufend
    equator
  • Questions?