Ordered Record Collection
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Ordered Record Collection

on

  • 3,546 views

Chris Douglas, Yahoo!

Chris Douglas, Yahoo!

Statistics

Views

Total Views
3,546
Views on SlideShare
3,134
Embed Views
412

Actions

Likes
11
Downloads
100
Comments
0

6 Embeds 412

http://developer.yahoo.net 250
http://developer.yahoo.com 133
http://yahoohadoop.tumblr.com 25
http://www.slideshare.net 2
http://reader.youdao.com 1
https://developer.yahoo.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Every presenter must include a slide like this one, and protocol demands that it contain no fewer than 5 inaccuracies

Ordered Record Collection Presentation Transcript

  • 1. Sort of Vinyl: Ordered Record Collection
    Chris Douglas
    01.18.2010
  • 2. Obligatory MapReduce Flow Slide
    Split 2
    Map 2
    Combine*
    Reduce 1
    Split 1
    Map 1
    hdfs://host:8020/input/data
    hdfs://host:8020/output/data
    HDFS
    HDFS
    Combine*
    Reduce 1
    Split 0
    Map 0
    Combine*
  • 3. Obligatory MapReduce Flow Slide
    Map Output Collection
    Split 2
    Map 2
    Combine*
    Reduce 1
    Split 1
    Map 1
    hdfs://host:8020/input/data
    hdfs://host:8020/output/data
    HDFS
    HDFS
    Combine*
    Reduce 1
    Split 0
    Map 0
    Combine*
  • 4. Overview
    Hadoop (∞, 0.10)
    Hadoop [ 0.10, 0.17)
    Hadoop [0.17, 0.22]
    Lucene
    HADOOP-331
    HADOOP-2919
  • 5. Overview
    Hadoop (∞, 0.10)
    Hadoop [ 0.10, 0.17)
    Hadoop [0.17, 0.22]
    Lucene
    HADOOP-331
    HADOOP-2919
    Cretaceous
    Jurassic
    Triassic
  • 6. Awesome!
  • 7. Problem Description
    map(K1,V1)
    *
    collect(K2,V2)
  • 8. Problem Description
    p0  partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
  • 9. Problem Description
    p0  partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
  • 10. Problem Description
    p0  partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
  • 11. Problem Description
    p0 partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
    key0
  • 12. Problem Description
    p0 partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
    key0
  • 13. Problem Description
    p0 partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
    key0
    val0
  • 14. Problem Description
    p0 partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
    key0
    val0
  • 15. Problem Description
    int
    p0 partition(key0,val0)
    map(K1,V1)
    *
    Serialization
    collect(K2,V2)
    *
    K2.write(DataOutput)
    write(byte[], int, int)
    *
    V2.write(DataOutput)
    write(byte[], int, int)
    key0
    val0
    byte[]
    byte[]
  • 16. Problem Description
    For all calls to collect(K2 keyn, V2 valn):
    • Store result of partition(K2 keyn, V2 valn)
    • 17. Ordered set of write(byte[], int, int) for keyn
    • 18. Ordered set of write(byte[], int, int) for valn
    Challenges:
    • Size of key/value unknown a priori
    • 19. Records must be grouped for efficient fetch from reduce
    • 20. Sort occurs after the records are serialized
  • Overview
    Hadoop (∞, 0.10)
    Hadoop [ 0.10, 0.17)
    Hadoop [0.17, 0.22]
    Lucene
    HADOOP-331
    HADOOP-2919
    Cretaceous
    Jurassic
    Triassic
  • 21. Hadoop (∞, 0.10)
    p0 partition(key0,val0)
    map(K1,V1)
    *
    collect(K2,V2)
    collect(K2,V2)
    SequenceFile::Writer[p0].append(key0, val0)


  • 22. Hadoop (∞, 0.10)
    p0 partition(key0,val0)
    map(K1,V1)
    *
    collect(K2,V2)
    collect(K2,V2)
    key0.write(localFS)
    SequenceFile::Writer[p0].append(key0, val0)
    val0.write(localFS)


  • 23. Hadoop (∞, 0.10)
    p0 partition(key0,val0)
    map(K1,V1)
    *
    collect(K2,V2)
    collect(K2,V2)
    key0.write(localFS)
    SequenceFile::Writer[p0].append(key0, val0)
    val0.write(localFS)


  • 24. Hadoop (∞, 0.10)
    Not necessarily true. SeqFile may buffer configurable amount of data to effect block compresion, stream buffering, etc.
    p0 partition(key0,val0)
    map(K1,V1)
    *
    collect(K2,V2)
    collect(K2,V2)
    key0.write(localFS)
    SequenceFile::Writer[p0].append(key0, val0)
    val0.write(localFS)


  • 25. Hadoop (∞, 0.10)
    key0
    key1
    clone(key0, val0)
    map(K1,V1)
    key2
    *
    flush()
    collect(K2,V2)
    collect(K2,V2)
    reduce(keyn, val*)
    SequenceFile::Writer[p0].append(keyn’, valn’)

    p0 partition(key0,val0)

  • 26. Hadoop (∞, 0.10)
    key0
    key1
    clone(key0, val0)
    map(K1,V1)
    key2
    *
    flush()
    collect(K2,V2)
    collect(K2,V2)
    reduce(keyn, val*)
    SequenceFile::Writer[p0].append(keyn’, valn’)

    p0 partition(key0,val0)

  • 27. Hadoop (∞, 0.10)
    key0
    key1
    clone(key0, val0)
    map(K1,V1)
    key2
    *
    flush()
    collect(K2,V2)
    collect(K2,V2)
    reduce(keyn, val*)
    SequenceFile::Writer[p0].append(keyn’, valn’)

    p0 partition(key0,val0)

    Combiner may change the partition and ordering of input records. This is no longer supported
  • 28. Hadoop (∞, 0.10)
    Reduce k
    Reduce 0

    TaskTracker

  • 29. Hadoop (∞, 0.10)
    Reduce k
    Reduce 0

    TaskTracker

  • 30. Hadoop (∞, 0.10)
    Reduce 0
    sort/merge  localFS

  • 31. Hadoop (∞, 0.10)
    Pro:
    • Complexity of sort/merge encapsulated in SequenceFile, shared between MapTask and ReduceTask
    • 32. Very versatile Combiner semantics (change sort order, partition)
    Con:
    • Copy/sort can take a long time for each reduce (lost opportunity to parallelize sort)
    • 33. Job cleanup is expensive (e.g. 7k reducer job must delete 7k files per map on that TT)
    • 34. Combiner is expensive to use and its memory usage is difficult to track
    • 35. OOMExceptions from untracked memory in buffers, particularly when using compression (HADOOP-570)
  • Overview
    Hadoop (∞, 0.10)
    Hadoop [ 0.10, 0.17)
    Hadoop [0.17, 0.22]
    Lucene
    HADOOP-331
    HADOOP-2919
    Cretaceous
    Jurassic
    Triassic
  • 36. Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
  • 37. Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
  • 38. Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
  • 39. Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
  • 40. Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
  • 41. Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    Add memory used by all BufferSorter implementations and keyValBuffer. If spill threshold exceeded, then spill contents to disk
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
    Keep offset into buffer, length of key, value.
  • 42. Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    *
    0
    1
    k-1
    k
    sortAndSpillToDisk()
    *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded
    0
  • 43. Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    *
    0
    1
    k-1
    k
    sortAndSpillToDisk()
    *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded
    0
    K2.readFields(DataInput)
    V2.readFields(DataInput)
    SequenceFile::append(K2,V2)
  • 44. Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
    *If defined, the combiner is now run during the spill, separately over each partition. Values emitted from the combiner are written directly to the output partition.
    0
    K2.readFields(DataInput)
    V2.readFields(DataInput)
    *
    << Combiner >>
    SequenceFile::append(K2,V2)
  • 45. Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    *
    0
    1
    k-1
    k
    sortAndSpillToDisk()
    0
    1
  • 46. Hadoop [0.10, 0.17)
    map(K1,V1)
    p0 partition(key0,val0)
    *
    collect(K2,V2)
    K2.write(DataOutput)
    V2.write(DataOutput)
    BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

    0
    1
    k-1
    k
    sortAndSpillToDisk()
    0
    1


    k
  • 47. Hadoop [0.10, 0.17)
    mergeParts()
    0
    0
    0
    1
    1
    1






    k
    k
    k
  • 48. Hadoop [0.10, 0.17)
    mergeParts()
    0
    0
    0
    0
    1
    1
    1






    k
    k
    k
  • 49. Hadoop [0.10, 0.17)
    Reduce 0
    0
    1

    TaskTracker


    k
    Reduce k
  • 50. Hadoop [0.10, 0.17)
    Reduce 0
    0
    1

    TaskTracker


    k
    Reduce k
  • 51. Hadoop [0.10, 0.17)
    Pro:
    • Distributes the sort/merge across all maps; reducer need only merge its inputs
    • 52. Much more predictable memory footprint
    • 53. Shared, in-memory buffer across all partitions w/ efficient sort
    • 54. Combines over each spill, defined by memory usage, instead of record count
    • 55. Running the combiner doesn’t require storing a clone of each record (fewer serializ.)
    • 56. In 0.16, spill was made concurrent with collection (HADOOP-1965)
    Con:
    • Expanding buffers may impose a performance penalty; used memory calculated on every call to collect(K2,V2)
    • 57. MergeSort copies indices on each level of recursion
    • 58. Deserializing the key/value before appending to the SequenceFile is avoidable
    • 59. Combiner weakened by requiring sort order and partition to remain consistent
    • 60. Though tracked, BufferSort instances take non-negligible space (HADOOP-1698)
  • Overview
    Hadoop (∞, 0.10)
    Hadoop [ 0.10, 0.17)
    Hadoop [0.17, 0.22]
    Lucene
    HADOOP-331
    HADOOP-2919
    Cretaceous
    Jurassic
    Triassic
  • 61. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
  • 62. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    io.sort.mb * io.sort.record.percent

    io.sort.mb
  • 63. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    KS.serialize(V2)
    Instead of explicitly tracking space used by record metadata, allocate a configurable amount of space at the beginning of the task
    io.sort.mb * io.sort.record.percent

    io.sort.mb
  • 64. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    bufindex
    bufmark
    io.sort.mb * io.sort.record.percent
    kvstart
    kvend
    kvindex
    io.sort.mb
  • 65. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    bufindex
    bufmark
    io.sort.mb * io.sort.record.percent
    kvstart
    kvend
    kvindex
    io.sort.mb
    kvoffsets
    kvindices
    Partition no longer implicitly tracked. Store (partition, keystart,valstart) for every record collected
    kvbuffer
  • 66. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    kvstart
    kvend
    kvindex
    bufindex
    bufmark
  • 67. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    kvstart
    kvend
    kvindex
    bufmark
    bufindex
  • 68. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    kvstart
    kvend
    kvindex
    bufmark
    bufindex
  • 69. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    kvstart
    kvend
    kvindex
    p0
    bufmark
    bufindex
  • 70. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufend
    kvstart
    kvend
    kvindex
    io.sort.spill.percent
    bufindex
    bufmark
  • 71. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    kvstart
    kvend
    kvindex
    bufend
    bufindex
    bufmark
  • 72. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    kvstart
    kvend
    kvindex
    bufindex
    bufmark
    bufend
  • 73. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufstart
    bufindex
    bufmark
    kvstart
    kvindex
    kvend
    bufend
  • 74. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufindex
    bufmark
    kvindex
    kvstart
    kvend
    bufstart
    bufend
  • 75. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    Invalid segments in the serialization buffer are marked by bufvoid
    RawComparator interface requires that the key be contiguous in the byte[]
    bufmark
    bufvoid
    bufindex
    kvindex
    kvstart
    kvend
    bufstart
    bufend
  • 76. Hadoop [0.17, 0.22)
    map(K1,V1)
    p0  partition(key0,val0)
    *
    Serialization
    KS.serialize(K2)
    collect(K2,V2)
    VS.serialize(V2)
    bufvoid
    bufmark
    bufindex
    kvindex
    kvstart
    kvend
    bufstart
    bufend
  • 77. Hadoop [0.17, 0.22)
    Pro:
    • Predictable memory footprint, collection (though not spill) agnostic to number of reducers. Most memory used for the sort allocated upfront and maintained for the full task duration.
    • 78. No resizing of buffers, copying of serialized record data or metadata
    • 79. Uses SequenceFile::appendRaw to avoid deserialization/serialization pass
    • 80. Effects record compression in-place (removed in 0.18 with improvements to intermediate data format HADOOP-2095)
    Other Performance Improvements
    • Improved performance, no metadata copying using QuickSort (HADOOP-3308)
    • 81. Caching of spill indices (HADOOP-3638)
    • 82. Run combiner during the merge (HADOOP-3226)
    • 83. Improved locking and synchronization (HADOOP-{5664,3617})
    Con:
    • Complexity and new code responsible for several bugs in 0.17
    • 84. (HADOOP-{3442,3550,3475,3603})
    • 85. io.sort.record.percent is obscure, critical to performance, and awkward
    • 86. While predictable, memory usage is arguably too restricted
    • 87. Really? io.sort.record.percent? (MAPREDUCE-64)
  • Hadoop [0.22]
    bufstart
    bufend
    bufindex
    bufmark
    equator
    kvstart
    kvend
    kvindex
  • 88. Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufindex
    bufmark
  • 89. Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufmark
    bufindex
  • 90. Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufmark
    bufindex
  • 91. Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufmark
    bufindex
  • 92. Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufmark
    bufindex
    p0
    kvoffsets and kvindices information interlaced into metadata blocks. The sort is effected in a manner identical to 0.17, but metadata is allocated per-record, rather than a priori
    (kvoffsets)
    (kvindices)
  • 93. Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufindex
    bufmark
  • 94. Hadoop [0.22]
    bufstart
    kvstart
    kvend
    bufend
    kvindex
    equator
    bufindex
    bufmark
  • 95. Hadoop [0.22]
    bufstart
    kvstart
    kvend
    bufend
    bufindex
    bufmark
    kvindex
    equator
  • 96. Hadoop [0.22]
    kvstart
    kvend
    bufstart
    bufend
    bufindex
    bufmark
    kvindex
    equator
  • 97. Hadoop [0.22]
    bufindex
    bufmark
    kvindex
    equator
    bufstart
    bufend
    kvstart
    kvend
  • 98. Hadoop [0.22]
    bufstart
    bufend
    equator
    kvstart
    kvend
    kvindex
    bufindex
    bufmark
  • 99. Hadoop [0.22]
    bufstart
    kvstart
    kvend
    kvindex
    bufindex
    bufmark
    bufend
    equator
  • 100. Hadoop [0.22]
    bufindex
    kvindex
    kvstart
    kvend
    bufmark
    bufstart
    bufend
    equator
  • 101. Questions?