Your SlideShare is downloading. ×
Ordered Record Collection
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Ordered Record Collection

2,793
views

Published on

Chris Douglas, Yahoo!

Chris Douglas, Yahoo!


0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,793
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
104
Comments
0
Likes
11
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Every presenter must include a slide like this one, and protocol demands that it contain no fewer than 5 inaccuracies
  • Transcript

    • 1. Sort of Vinyl: Ordered Record Collection
      Chris Douglas
      01.18.2010
    • 2. Obligatory MapReduce Flow Slide
      Split 2
      Map 2
      Combine*
      Reduce 1
      Split 1
      Map 1
      hdfs://host:8020/input/data
      hdfs://host:8020/output/data
      HDFS
      HDFS
      Combine*
      Reduce 1
      Split 0
      Map 0
      Combine*
    • 3. Obligatory MapReduce Flow Slide
      Map Output Collection
      Split 2
      Map 2
      Combine*
      Reduce 1
      Split 1
      Map 1
      hdfs://host:8020/input/data
      hdfs://host:8020/output/data
      HDFS
      HDFS
      Combine*
      Reduce 1
      Split 0
      Map 0
      Combine*
    • 4. Overview
      Hadoop (∞, 0.10)
      Hadoop [ 0.10, 0.17)
      Hadoop [0.17, 0.22]
      Lucene
      HADOOP-331
      HADOOP-2919
    • 5. Overview
      Hadoop (∞, 0.10)
      Hadoop [ 0.10, 0.17)
      Hadoop [0.17, 0.22]
      Lucene
      HADOOP-331
      HADOOP-2919
      Cretaceous
      Jurassic
      Triassic
    • 6. Awesome!
    • 7. Problem Description
      map(K1,V1)
      *
      collect(K2,V2)
    • 8. Problem Description
      p0  partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
    • 9. Problem Description
      p0  partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
    • 10. Problem Description
      p0  partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
    • 11. Problem Description
      p0 partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
      key0
    • 12. Problem Description
      p0 partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
      key0
    • 13. Problem Description
      p0 partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
      key0
      val0
    • 14. Problem Description
      p0 partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
      key0
      val0
    • 15. Problem Description
      int
      p0 partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
      key0
      val0
      byte[]
      byte[]
    • 16. Problem Description
      For all calls to collect(K2 keyn, V2 valn):
      • Store result of partition(K2 keyn, V2 valn)
      • 17. Ordered set of write(byte[], int, int) for keyn
      • 18. Ordered set of write(byte[], int, int) for valn
      Challenges:
      • Size of key/value unknown a priori
      • 19. Records must be grouped for efficient fetch from reduce
      • 20. Sort occurs after the records are serialized
    • Overview
      Hadoop (∞, 0.10)
      Hadoop [ 0.10, 0.17)
      Hadoop [0.17, 0.22]
      Lucene
      HADOOP-331
      HADOOP-2919
      Cretaceous
      Jurassic
      Triassic
    • 21. Hadoop (∞, 0.10)
      p0 partition(key0,val0)
      map(K1,V1)
      *
      collect(K2,V2)
      collect(K2,V2)
      SequenceFile::Writer[p0].append(key0, val0)


    • 22. Hadoop (∞, 0.10)
      p0 partition(key0,val0)
      map(K1,V1)
      *
      collect(K2,V2)
      collect(K2,V2)
      key0.write(localFS)
      SequenceFile::Writer[p0].append(key0, val0)
      val0.write(localFS)


    • 23. Hadoop (∞, 0.10)
      p0 partition(key0,val0)
      map(K1,V1)
      *
      collect(K2,V2)
      collect(K2,V2)
      key0.write(localFS)
      SequenceFile::Writer[p0].append(key0, val0)
      val0.write(localFS)


    • 24. Hadoop (∞, 0.10)
      Not necessarily true. SeqFile may buffer configurable amount of data to effect block compresion, stream buffering, etc.
      p0 partition(key0,val0)
      map(K1,V1)
      *
      collect(K2,V2)
      collect(K2,V2)
      key0.write(localFS)
      SequenceFile::Writer[p0].append(key0, val0)
      val0.write(localFS)


    • 25. Hadoop (∞, 0.10)
      key0
      key1
      clone(key0, val0)
      map(K1,V1)
      key2
      *
      flush()
      collect(K2,V2)
      collect(K2,V2)
      reduce(keyn, val*)
      SequenceFile::Writer[p0].append(keyn’, valn’)

      p0 partition(key0,val0)

    • 26. Hadoop (∞, 0.10)
      key0
      key1
      clone(key0, val0)
      map(K1,V1)
      key2
      *
      flush()
      collect(K2,V2)
      collect(K2,V2)
      reduce(keyn, val*)
      SequenceFile::Writer[p0].append(keyn’, valn’)

      p0 partition(key0,val0)

    • 27. Hadoop (∞, 0.10)
      key0
      key1
      clone(key0, val0)
      map(K1,V1)
      key2
      *
      flush()
      collect(K2,V2)
      collect(K2,V2)
      reduce(keyn, val*)
      SequenceFile::Writer[p0].append(keyn’, valn’)

      p0 partition(key0,val0)

      Combiner may change the partition and ordering of input records. This is no longer supported
    • 28. Hadoop (∞, 0.10)
      Reduce k
      Reduce 0

      TaskTracker

    • 29. Hadoop (∞, 0.10)
      Reduce k
      Reduce 0

      TaskTracker

    • 30. Hadoop (∞, 0.10)
      Reduce 0
      sort/merge  localFS

    • 31. Hadoop (∞, 0.10)
      Pro:
      • Complexity of sort/merge encapsulated in SequenceFile, shared between MapTask and ReduceTask
      • 32. Very versatile Combiner semantics (change sort order, partition)
      Con:
      • Copy/sort can take a long time for each reduce (lost opportunity to parallelize sort)
      • 33. Job cleanup is expensive (e.g. 7k reducer job must delete 7k files per map on that TT)
      • 34. Combiner is expensive to use and its memory usage is difficult to track
      • 35. OOMExceptions from untracked memory in buffers, particularly when using compression (HADOOP-570)
    • Overview
      Hadoop (∞, 0.10)
      Hadoop [ 0.10, 0.17)
      Hadoop [0.17, 0.22]
      Lucene
      HADOOP-331
      HADOOP-2919
      Cretaceous
      Jurassic
      Triassic
    • 36. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
    • 37. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
    • 38. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
    • 39. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
    • 40. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
    • 41. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      Add memory used by all BufferSorter implementations and keyValBuffer. If spill threshold exceeded, then spill contents to disk
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
      Keep offset into buffer, length of key, value.
    • 42. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      *
      0
      1
      k-1
      k
      sortAndSpillToDisk()
      *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded
      0
    • 43. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      *
      0
      1
      k-1
      k
      sortAndSpillToDisk()
      *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded
      0
      K2.readFields(DataInput)
      V2.readFields(DataInput)
      SequenceFile::append(K2,V2)
    • 44. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
      *If defined, the combiner is now run during the spill, separately over each partition. Values emitted from the combiner are written directly to the output partition.
      0
      K2.readFields(DataInput)
      V2.readFields(DataInput)
      *
      << Combiner >>
      SequenceFile::append(K2,V2)
    • 45. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      *
      0
      1
      k-1
      k
      sortAndSpillToDisk()
      0
      1
    • 46. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
      0
      1


      k
    • 47. Hadoop [0.10, 0.17)
      mergeParts()
      0
      0
      0
      1
      1
      1






      k
      k
      k
    • 48. Hadoop [0.10, 0.17)
      mergeParts()
      0
      0
      0
      0
      1
      1
      1






      k
      k
      k
    • 49. Hadoop [0.10, 0.17)
      Reduce 0
      0
      1

      TaskTracker


      k
      Reduce k
    • 50. Hadoop [0.10, 0.17)
      Reduce 0
      0
      1

      TaskTracker


      k
      Reduce k
    • 51. Hadoop [0.10, 0.17)
      Pro:
      • Distributes the sort/merge across all maps; reducer need only merge its inputs
      • 52. Much more predictable memory footprint
      • 53. Shared, in-memory buffer across all partitions w/ efficient sort
      • 54. Combines over each spill, defined by memory usage, instead of record count
      • 55. Running the combiner doesn’t require storing a clone of each record (fewer serializ.)
      • 56. In 0.16, spill was made concurrent with collection (HADOOP-1965)
      Con:
      • Expanding buffers may impose a performance penalty; used memory calculated on every call to collect(K2,V2)
      • 57. MergeSort copies indices on each level of recursion
      • 58. Deserializing the key/value before appending to the SequenceFile is avoidable
      • 59. Combiner weakened by requiring sort order and partition to remain consistent
      • 60. Though tracked, BufferSort instances take non-negligible space (HADOOP-1698)
    • Overview
      Hadoop (∞, 0.10)
      Hadoop [ 0.10, 0.17)
      Hadoop [0.17, 0.22]
      Lucene
      HADOOP-331
      HADOOP-2919
      Cretaceous
      Jurassic
      Triassic
    • 61. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
    • 62. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      io.sort.mb * io.sort.record.percent

      io.sort.mb
    • 63. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      KS.serialize(V2)
      Instead of explicitly tracking space used by record metadata, allocate a configurable amount of space at the beginning of the task
      io.sort.mb * io.sort.record.percent

      io.sort.mb
    • 64. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      bufindex
      bufmark
      io.sort.mb * io.sort.record.percent
      kvstart
      kvend
      kvindex
      io.sort.mb
    • 65. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      bufindex
      bufmark
      io.sort.mb * io.sort.record.percent
      kvstart
      kvend
      kvindex
      io.sort.mb
      kvoffsets
      kvindices
      Partition no longer implicitly tracked. Store (partition, keystart,valstart) for every record collected
      kvbuffer
    • 66. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      kvstart
      kvend
      kvindex
      bufindex
      bufmark
    • 67. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      kvstart
      kvend
      kvindex
      bufmark
      bufindex
    • 68. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      kvstart
      kvend
      kvindex
      bufmark
      bufindex
    • 69. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      kvstart
      kvend
      kvindex
      p0
      bufmark
      bufindex
    • 70. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      kvstart
      kvend
      kvindex
      io.sort.spill.percent
      bufindex
      bufmark
    • 71. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      kvstart
      kvend
      kvindex
      bufend
      bufindex
      bufmark
    • 72. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      kvstart
      kvend
      kvindex
      bufindex
      bufmark
      bufend
    • 73. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufindex
      bufmark
      kvstart
      kvindex
      kvend
      bufend
    • 74. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufindex
      bufmark
      kvindex
      kvstart
      kvend
      bufstart
      bufend
    • 75. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      Invalid segments in the serialization buffer are marked by bufvoid
      RawComparator interface requires that the key be contiguous in the byte[]
      bufmark
      bufvoid
      bufindex
      kvindex
      kvstart
      kvend
      bufstart
      bufend
    • 76. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufvoid
      bufmark
      bufindex
      kvindex
      kvstart
      kvend
      bufstart
      bufend
    • 77. Hadoop [0.17, 0.22)
      Pro:
      • Predictable memory footprint, collection (though not spill) agnostic to number of reducers. Most memory used for the sort allocated upfront and maintained for the full task duration.
      • 78. No resizing of buffers, copying of serialized record data or metadata
      • 79. Uses SequenceFile::appendRaw to avoid deserialization/serialization pass
      • 80. Effects record compression in-place (removed in 0.18 with improvements to intermediate data format HADOOP-2095)
      Other Performance Improvements
      • Improved performance, no metadata copying using QuickSort (HADOOP-3308)
      • 81. Caching of spill indices (HADOOP-3638)
      • 82. Run combiner during the merge (HADOOP-3226)
      • 83. Improved locking and synchronization (HADOOP-{5664,3617})
      Con:
      • Complexity and new code responsible for several bugs in 0.17
      • 84. (HADOOP-{3442,3550,3475,3603})
      • 85. io.sort.record.percent is obscure, critical to performance, and awkward
      • 86. While predictable, memory usage is arguably too restricted
      • 87. Really? io.sort.record.percent? (MAPREDUCE-64)
    • Hadoop [0.22]
      bufstart
      bufend
      bufindex
      bufmark
      equator
      kvstart
      kvend
      kvindex
    • 88. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufindex
      bufmark
    • 89. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufmark
      bufindex
    • 90. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufmark
      bufindex
    • 91. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufmark
      bufindex
    • 92. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufmark
      bufindex
      p0
      kvoffsets and kvindices information interlaced into metadata blocks. The sort is effected in a manner identical to 0.17, but metadata is allocated per-record, rather than a priori
      (kvoffsets)
      (kvindices)
    • 93. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufindex
      bufmark
    • 94. Hadoop [0.22]
      bufstart
      kvstart
      kvend
      bufend
      kvindex
      equator
      bufindex
      bufmark
    • 95. Hadoop [0.22]
      bufstart
      kvstart
      kvend
      bufend
      bufindex
      bufmark
      kvindex
      equator
    • 96. Hadoop [0.22]
      kvstart
      kvend
      bufstart
      bufend
      bufindex
      bufmark
      kvindex
      equator
    • 97. Hadoop [0.22]
      bufindex
      bufmark
      kvindex
      equator
      bufstart
      bufend
      kvstart
      kvend
    • 98. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufindex
      bufmark
    • 99. Hadoop [0.22]
      bufstart
      kvstart
      kvend
      kvindex
      bufindex
      bufmark
      bufend
      equator
    • 100. Hadoop [0.22]
      bufindex
      kvindex
      kvstart
      kvend
      bufmark
      bufstart
      bufend
      equator
    • 101. Questions?