Your SlideShare is downloading. ×
Ordered Record Collection
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Ordered Record Collection

2,766
views

Published on

Chris Douglas, Yahoo!

Chris Douglas, Yahoo!


0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,766
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
104
Comments
0
Likes
11
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Every presenter must include a slide like this one, and protocol demands that it contain no fewer than 5 inaccuracies
  • Transcript

    • 1. Sort of Vinyl: Ordered Record Collection
      Chris Douglas
      01.18.2010
    • 2. Obligatory MapReduce Flow Slide
      Split 2
      Map 2
      Combine*
      Reduce 1
      Split 1
      Map 1
      hdfs://host:8020/input/data
      hdfs://host:8020/output/data
      HDFS
      HDFS
      Combine*
      Reduce 1
      Split 0
      Map 0
      Combine*
    • 3. Obligatory MapReduce Flow Slide
      Map Output Collection
      Split 2
      Map 2
      Combine*
      Reduce 1
      Split 1
      Map 1
      hdfs://host:8020/input/data
      hdfs://host:8020/output/data
      HDFS
      HDFS
      Combine*
      Reduce 1
      Split 0
      Map 0
      Combine*
    • 4. Overview
      Hadoop (∞, 0.10)
      Hadoop [ 0.10, 0.17)
      Hadoop [0.17, 0.22]
      Lucene
      HADOOP-331
      HADOOP-2919
    • 5. Overview
      Hadoop (∞, 0.10)
      Hadoop [ 0.10, 0.17)
      Hadoop [0.17, 0.22]
      Lucene
      HADOOP-331
      HADOOP-2919
      Cretaceous
      Jurassic
      Triassic
    • 6. Awesome!
    • 7. Problem Description
      map(K1,V1)
      *
      collect(K2,V2)
    • 8. Problem Description
      p0  partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
    • 9. Problem Description
      p0  partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
    • 10. Problem Description
      p0  partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
    • 11. Problem Description
      p0 partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
      key0
    • 12. Problem Description
      p0 partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
      key0
    • 13. Problem Description
      p0 partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
      key0
      val0
    • 14. Problem Description
      p0 partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
      key0
      val0
    • 15. Problem Description
      int
      p0 partition(key0,val0)
      map(K1,V1)
      *
      Serialization
      collect(K2,V2)
      *
      K2.write(DataOutput)
      write(byte[], int, int)
      *
      V2.write(DataOutput)
      write(byte[], int, int)
      key0
      val0
      byte[]
      byte[]
    • 16. Problem Description
      For all calls to collect(K2 keyn, V2 valn):
      • Store result of partition(K2 keyn, V2 valn)
      • 17. Ordered set of write(byte[], int, int) for keyn
      • 18. Ordered set of write(byte[], int, int) for valn
      Challenges:
      • Size of key/value unknown a priori
      • 19. Records must be grouped for efficient fetch from reduce
      • 20. Sort occurs after the records are serialized
    • Overview
      Hadoop (∞, 0.10)
      Hadoop [ 0.10, 0.17)
      Hadoop [0.17, 0.22]
      Lucene
      HADOOP-331
      HADOOP-2919
      Cretaceous
      Jurassic
      Triassic
    • 21. Hadoop (∞, 0.10)
      p0 partition(key0,val0)
      map(K1,V1)
      *
      collect(K2,V2)
      collect(K2,V2)
      SequenceFile::Writer[p0].append(key0, val0)


    • 22. Hadoop (∞, 0.10)
      p0 partition(key0,val0)
      map(K1,V1)
      *
      collect(K2,V2)
      collect(K2,V2)
      key0.write(localFS)
      SequenceFile::Writer[p0].append(key0, val0)
      val0.write(localFS)


    • 23. Hadoop (∞, 0.10)
      p0 partition(key0,val0)
      map(K1,V1)
      *
      collect(K2,V2)
      collect(K2,V2)
      key0.write(localFS)
      SequenceFile::Writer[p0].append(key0, val0)
      val0.write(localFS)


    • 24. Hadoop (∞, 0.10)
      Not necessarily true. SeqFile may buffer configurable amount of data to effect block compresion, stream buffering, etc.
      p0 partition(key0,val0)
      map(K1,V1)
      *
      collect(K2,V2)
      collect(K2,V2)
      key0.write(localFS)
      SequenceFile::Writer[p0].append(key0, val0)
      val0.write(localFS)


    • 25. Hadoop (∞, 0.10)
      key0
      key1
      clone(key0, val0)
      map(K1,V1)
      key2
      *
      flush()
      collect(K2,V2)
      collect(K2,V2)
      reduce(keyn, val*)
      SequenceFile::Writer[p0].append(keyn’, valn’)

      p0 partition(key0,val0)

    • 26. Hadoop (∞, 0.10)
      key0
      key1
      clone(key0, val0)
      map(K1,V1)
      key2
      *
      flush()
      collect(K2,V2)
      collect(K2,V2)
      reduce(keyn, val*)
      SequenceFile::Writer[p0].append(keyn’, valn’)

      p0 partition(key0,val0)

    • 27. Hadoop (∞, 0.10)
      key0
      key1
      clone(key0, val0)
      map(K1,V1)
      key2
      *
      flush()
      collect(K2,V2)
      collect(K2,V2)
      reduce(keyn, val*)
      SequenceFile::Writer[p0].append(keyn’, valn’)

      p0 partition(key0,val0)

      Combiner may change the partition and ordering of input records. This is no longer supported
    • 28. Hadoop (∞, 0.10)
      Reduce k
      Reduce 0

      TaskTracker

    • 29. Hadoop (∞, 0.10)
      Reduce k
      Reduce 0

      TaskTracker

    • 30. Hadoop (∞, 0.10)
      Reduce 0
      sort/merge  localFS

    • 31. Hadoop (∞, 0.10)
      Pro:
      • Complexity of sort/merge encapsulated in SequenceFile, shared between MapTask and ReduceTask
      • 32. Very versatile Combiner semantics (change sort order, partition)
      Con:
      • Copy/sort can take a long time for each reduce (lost opportunity to parallelize sort)
      • 33. Job cleanup is expensive (e.g. 7k reducer job must delete 7k files per map on that TT)
      • 34. Combiner is expensive to use and its memory usage is difficult to track
      • 35. OOMExceptions from untracked memory in buffers, particularly when using compression (HADOOP-570)
    • Overview
      Hadoop (∞, 0.10)
      Hadoop [ 0.10, 0.17)
      Hadoop [0.17, 0.22]
      Lucene
      HADOOP-331
      HADOOP-2919
      Cretaceous
      Jurassic
      Triassic
    • 36. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
    • 37. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
    • 38. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
    • 39. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
    • 40. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
    • 41. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      Add memory used by all BufferSorter implementations and keyValBuffer. If spill threshold exceeded, then spill contents to disk
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
      Keep offset into buffer, length of key, value.
    • 42. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      *
      0
      1
      k-1
      k
      sortAndSpillToDisk()
      *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded
      0
    • 43. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      *
      0
      1
      k-1
      k
      sortAndSpillToDisk()
      *Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded
      0
      K2.readFields(DataInput)
      V2.readFields(DataInput)
      SequenceFile::append(K2,V2)
    • 44. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
      *If defined, the combiner is now run during the spill, separately over each partition. Values emitted from the combiner are written directly to the output partition.
      0
      K2.readFields(DataInput)
      V2.readFields(DataInput)
      *
      << Combiner >>
      SequenceFile::append(K2,V2)
    • 45. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      *
      0
      1
      k-1
      k
      sortAndSpillToDisk()
      0
      1
    • 46. Hadoop [0.10, 0.17)
      map(K1,V1)
      p0 partition(key0,val0)
      *
      collect(K2,V2)
      K2.write(DataOutput)
      V2.write(DataOutput)
      BufferSorter[p0].addKeyValue(recOff, keylen, vallen)

      0
      1
      k-1
      k
      sortAndSpillToDisk()
      0
      1


      k
    • 47. Hadoop [0.10, 0.17)
      mergeParts()
      0
      0
      0
      1
      1
      1






      k
      k
      k
    • 48. Hadoop [0.10, 0.17)
      mergeParts()
      0
      0
      0
      0
      1
      1
      1






      k
      k
      k
    • 49. Hadoop [0.10, 0.17)
      Reduce 0
      0
      1

      TaskTracker


      k
      Reduce k
    • 50. Hadoop [0.10, 0.17)
      Reduce 0
      0
      1

      TaskTracker


      k
      Reduce k
    • 51. Hadoop [0.10, 0.17)
      Pro:
      • Distributes the sort/merge across all maps; reducer need only merge its inputs
      • 52. Much more predictable memory footprint
      • 53. Shared, in-memory buffer across all partitions w/ efficient sort
      • 54. Combines over each spill, defined by memory usage, instead of record count
      • 55. Running the combiner doesn’t require storing a clone of each record (fewer serializ.)
      • 56. In 0.16, spill was made concurrent with collection (HADOOP-1965)
      Con:
      • Expanding buffers may impose a performance penalty; used memory calculated on every call to collect(K2,V2)
      • 57. MergeSort copies indices on each level of recursion
      • 58. Deserializing the key/value before appending to the SequenceFile is avoidable
      • 59. Combiner weakened by requiring sort order and partition to remain consistent
      • 60. Though tracked, BufferSort instances take non-negligible space (HADOOP-1698)
    • Overview
      Hadoop (∞, 0.10)
      Hadoop [ 0.10, 0.17)
      Hadoop [0.17, 0.22]
      Lucene
      HADOOP-331
      HADOOP-2919
      Cretaceous
      Jurassic
      Triassic
    • 61. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
    • 62. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      io.sort.mb * io.sort.record.percent

      io.sort.mb
    • 63. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      KS.serialize(V2)
      Instead of explicitly tracking space used by record metadata, allocate a configurable amount of space at the beginning of the task
      io.sort.mb * io.sort.record.percent

      io.sort.mb
    • 64. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      bufindex
      bufmark
      io.sort.mb * io.sort.record.percent
      kvstart
      kvend
      kvindex
      io.sort.mb
    • 65. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      bufindex
      bufmark
      io.sort.mb * io.sort.record.percent
      kvstart
      kvend
      kvindex
      io.sort.mb
      kvoffsets
      kvindices
      Partition no longer implicitly tracked. Store (partition, keystart,valstart) for every record collected
      kvbuffer
    • 66. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      kvstart
      kvend
      kvindex
      bufindex
      bufmark
    • 67. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      kvstart
      kvend
      kvindex
      bufmark
      bufindex
    • 68. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      kvstart
      kvend
      kvindex
      bufmark
      bufindex
    • 69. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      kvstart
      kvend
      kvindex
      p0
      bufmark
      bufindex
    • 70. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufend
      kvstart
      kvend
      kvindex
      io.sort.spill.percent
      bufindex
      bufmark
    • 71. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      kvstart
      kvend
      kvindex
      bufend
      bufindex
      bufmark
    • 72. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      kvstart
      kvend
      kvindex
      bufindex
      bufmark
      bufend
    • 73. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufstart
      bufindex
      bufmark
      kvstart
      kvindex
      kvend
      bufend
    • 74. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufindex
      bufmark
      kvindex
      kvstart
      kvend
      bufstart
      bufend
    • 75. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      Invalid segments in the serialization buffer are marked by bufvoid
      RawComparator interface requires that the key be contiguous in the byte[]
      bufmark
      bufvoid
      bufindex
      kvindex
      kvstart
      kvend
      bufstart
      bufend
    • 76. Hadoop [0.17, 0.22)
      map(K1,V1)
      p0  partition(key0,val0)
      *
      Serialization
      KS.serialize(K2)
      collect(K2,V2)
      VS.serialize(V2)
      bufvoid
      bufmark
      bufindex
      kvindex
      kvstart
      kvend
      bufstart
      bufend
    • 77. Hadoop [0.17, 0.22)
      Pro:
      • Predictable memory footprint, collection (though not spill) agnostic to number of reducers. Most memory used for the sort allocated upfront and maintained for the full task duration.
      • 78. No resizing of buffers, copying of serialized record data or metadata
      • 79. Uses SequenceFile::appendRaw to avoid deserialization/serialization pass
      • 80. Effects record compression in-place (removed in 0.18 with improvements to intermediate data format HADOOP-2095)
      Other Performance Improvements
      • Improved performance, no metadata copying using QuickSort (HADOOP-3308)
      • 81. Caching of spill indices (HADOOP-3638)
      • 82. Run combiner during the merge (HADOOP-3226)
      • 83. Improved locking and synchronization (HADOOP-{5664,3617})
      Con:
      • Complexity and new code responsible for several bugs in 0.17
      • 84. (HADOOP-{3442,3550,3475,3603})
      • 85. io.sort.record.percent is obscure, critical to performance, and awkward
      • 86. While predictable, memory usage is arguably too restricted
      • 87. Really? io.sort.record.percent? (MAPREDUCE-64)
    • Hadoop [0.22]
      bufstart
      bufend
      bufindex
      bufmark
      equator
      kvstart
      kvend
      kvindex
    • 88. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufindex
      bufmark
    • 89. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufmark
      bufindex
    • 90. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufmark
      bufindex
    • 91. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufmark
      bufindex
    • 92. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufmark
      bufindex
      p0
      kvoffsets and kvindices information interlaced into metadata blocks. The sort is effected in a manner identical to 0.17, but metadata is allocated per-record, rather than a priori
      (kvoffsets)
      (kvindices)
    • 93. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufindex
      bufmark
    • 94. Hadoop [0.22]
      bufstart
      kvstart
      kvend
      bufend
      kvindex
      equator
      bufindex
      bufmark
    • 95. Hadoop [0.22]
      bufstart
      kvstart
      kvend
      bufend
      bufindex
      bufmark
      kvindex
      equator
    • 96. Hadoop [0.22]
      kvstart
      kvend
      bufstart
      bufend
      bufindex
      bufmark
      kvindex
      equator
    • 97. Hadoop [0.22]
      bufindex
      bufmark
      kvindex
      equator
      bufstart
      bufend
      kvstart
      kvend
    • 98. Hadoop [0.22]
      bufstart
      bufend
      equator
      kvstart
      kvend
      kvindex
      bufindex
      bufmark
    • 99. Hadoop [0.22]
      bufstart
      kvstart
      kvend
      kvindex
      bufindex
      bufmark
      bufend
      equator
    • 100. Hadoop [0.22]
      bufindex
      kvindex
      kvstart
      kvend
      bufmark
      bufstart
      bufend
      equator
    • 101. Questions?