Sort of Vinyl: Ordered Record CollectionChris Douglas01.18.2010
Obligatory MapReduce Flow SlideSplit 2Map 2Combine*Reduce 1Split 1Map 1hdfs://host:8020/input/datahdfs://host:8020/output/dataHDFSHDFSCombine*Reduce 1Split 0Map 0Combine*
Obligatory MapReduce Flow SlideMap Output CollectionSplit 2Map 2Combine*Reduce 1Split 1Map 1hdfs://host:8020/input/datahdfs://host:8020/output/dataHDFSHDFSCombine*Reduce 1Split 0Map 0Combine*
OverviewHadoop (∞, 0.10)Hadoop [ 0.10, 0.17)Hadoop [0.17, 0.22]LuceneHADOOP-331HADOOP-2919
OverviewHadoop (∞, 0.10)Hadoop [ 0.10, 0.17)Hadoop [0.17, 0.22]LuceneHADOOP-331HADOOP-2919CretaceousJurassicTriassic
Awesome!
Problem Descriptionmap(K1,V1)*collect(K2,V2)
Problem Descriptionp0  partition(key0,val0)map(K1,V1)*Serializationcollect(K2,V2)*K2.write(DataOutput)write(byte[], int, int)*V2.write(DataOutput)write(byte[], int, int)
Problem Descriptionp0  partition(key0,val0)map(K1,V1)*Serializationcollect(K2,V2)*K2.write(DataOutput)write(byte[], int, int)*V2.write(DataOutput)write(byte[], int, int)
Problem Descriptionp0  partition(key0,val0)map(K1,V1)*Serializationcollect(K2,V2)*K2.write(DataOutput)write(byte[], int, int)*V2.write(DataOutput)write(byte[], int, int)
Problem Descriptionp0 partition(key0,val0)map(K1,V1)*Serializationcollect(K2,V2)*K2.write(DataOutput)write(byte[], int, int)*V2.write(DataOutput)write(byte[], int, int)key0
Problem Descriptionp0 partition(key0,val0)map(K1,V1)*Serializationcollect(K2,V2)*K2.write(DataOutput)write(byte[], int, int)*V2.write(DataOutput)write(byte[], int, int)key0
Problem Descriptionp0 partition(key0,val0)map(K1,V1)*Serializationcollect(K2,V2)*K2.write(DataOutput)write(byte[], int, int)*V2.write(DataOutput)write(byte[], int, int)key0val0
Problem Descriptionp0 partition(key0,val0)map(K1,V1)*Serializationcollect(K2,V2)*K2.write(DataOutput)write(byte[], int, int)*V2.write(DataOutput)write(byte[], int, int)key0val0
Problem Descriptionintp0 partition(key0,val0)map(K1,V1)*Serializationcollect(K2,V2)*K2.write(DataOutput)write(byte[], int, int)*V2.write(DataOutput)write(byte[], int, int)key0val0byte[]byte[]
Problem DescriptionFor all calls to collect(K2 keyn, V2 valn):Store result of partition(K2 keyn, V2 valn)
Ordered set of write(byte[], int, int) for keyn
Ordered set of write(byte[], int, int) for valnChallenges:Size of key/value unknown a priori
Records must be grouped for efficient fetch from reduce
Sort occurs after the records are serializedOverviewHadoop (∞, 0.10)Hadoop [ 0.10, 0.17)Hadoop [0.17, 0.22]LuceneHADOOP-331HADOOP-2919CretaceousJurassicTriassic
Hadoop (∞, 0.10)p0 partition(key0,val0)map(K1,V1)*collect(K2,V2)collect(K2,V2)SequenceFile::Writer[p0].append(key0, val0)……
Hadoop (∞, 0.10)p0 partition(key0,val0)map(K1,V1)*collect(K2,V2)collect(K2,V2)key0.write(localFS)SequenceFile::Writer[p0].append(key0, val0)val0.write(localFS)……
Hadoop (∞, 0.10)p0 partition(key0,val0)map(K1,V1)*collect(K2,V2)collect(K2,V2)key0.write(localFS)SequenceFile::Writer[p0].append(key0, val0)val0.write(localFS)……
Hadoop (∞, 0.10)Not necessarily true. SeqFile may buffer configurable amount of data to effect block compresion, stream buffering, etc.p0 partition(key0,val0)map(K1,V1)*collect(K2,V2)collect(K2,V2)key0.write(localFS)SequenceFile::Writer[p0].append(key0, val0)val0.write(localFS)……
Hadoop (∞, 0.10)key0key1clone(key0, val0)map(K1,V1)key2*flush()collect(K2,V2)collect(K2,V2)reduce(keyn, val*)SequenceFile::Writer[p0].append(keyn’, valn’)…p0 partition(key0,val0)…
Hadoop (∞, 0.10)key0key1clone(key0, val0)map(K1,V1)key2*flush()collect(K2,V2)collect(K2,V2)reduce(keyn, val*)SequenceFile::Writer[p0].append(keyn’, valn’)…p0 partition(key0,val0)…
Hadoop (∞, 0.10)key0key1clone(key0, val0)map(K1,V1)key2*flush()collect(K2,V2)collect(K2,V2)reduce(keyn, val*)SequenceFile::Writer[p0].append(keyn’, valn’)…p0 partition(key0,val0)…Combiner may change the partition and ordering of input records. This is no longer supported
Hadoop (∞, 0.10)Reduce kReduce 0…TaskTracker…
Hadoop (∞, 0.10)Reduce kReduce 0…TaskTracker…
Hadoop (∞, 0.10)Reduce 0sort/merge  localFS…
Hadoop (∞, 0.10)Pro:Complexity of sort/merge encapsulated in SequenceFile, shared between MapTask and ReduceTask
Very versatile Combiner semantics (change sort order, partition)Con:Copy/sort can take a long time for each reduce (lost opportunity to parallelize sort)
Job cleanup is expensive (e.g. 7k reducer job must delete 7k files per map on that TT)
Combiner is expensive to use and its memory usage is difficult to track
OOMExceptions from untracked memory in buffers, particularly when using compression (HADOOP-570)OverviewHadoop (∞, 0.10)Hadoop [ 0.10, 0.17)Hadoop [0.17, 0.22]LuceneHADOOP-331HADOOP-2919CretaceousJurassicTriassic
Hadoop [0.10, 0.17)map(K1,V1)p0 partition(key0,val0)*collect(K2,V2)K2.write(DataOutput)V2.write(DataOutput)BufferSorter[p0].addKeyValue(recOff, keylen, vallen)…01k-1ksortAndSpillToDisk()
Hadoop [0.10, 0.17)map(K1,V1)p0 partition(key0,val0)*collect(K2,V2)K2.write(DataOutput)V2.write(DataOutput)BufferSorter[p0].addKeyValue(recOff, keylen, vallen)…01k-1ksortAndSpillToDisk()
Hadoop [0.10, 0.17)map(K1,V1)p0 partition(key0,val0)*collect(K2,V2)K2.write(DataOutput)V2.write(DataOutput)BufferSorter[p0].addKeyValue(recOff, keylen, vallen)…01k-1ksortAndSpillToDisk()
Hadoop [0.10, 0.17)map(K1,V1)p0 partition(key0,val0)*collect(K2,V2)K2.write(DataOutput)V2.write(DataOutput)BufferSorter[p0].addKeyValue(recOff, keylen, vallen)…01k-1ksortAndSpillToDisk()
Hadoop [0.10, 0.17)map(K1,V1)p0 partition(key0,val0)*collect(K2,V2)K2.write(DataOutput)V2.write(DataOutput)BufferSorter[p0].addKeyValue(recOff, keylen, vallen)…01k-1ksortAndSpillToDisk()
Hadoop [0.10, 0.17)map(K1,V1)p0 partition(key0,val0)Add memory used by all BufferSorter implementations and keyValBuffer. If spill threshold exceeded, then spill contents to disk*collect(K2,V2)K2.write(DataOutput)V2.write(DataOutput)BufferSorter[p0].addKeyValue(recOff, keylen, vallen)…01k-1ksortAndSpillToDisk()Keep offset into buffer, length of key, value.
Hadoop [0.10, 0.17)map(K1,V1)p0 partition(key0,val0)*collect(K2,V2)K2.write(DataOutput)V2.write(DataOutput)BufferSorter[p0].addKeyValue(recOff, keylen, vallen)…*01k-1ksortAndSpillToDisk()*Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded0
Hadoop [0.10, 0.17)map(K1,V1)p0 partition(key0,val0)*collect(K2,V2)K2.write(DataOutput)V2.write(DataOutput)BufferSorter[p0].addKeyValue(recOff, keylen, vallen)…*01k-1ksortAndSpillToDisk()*Sort permutes offsets into (offset,keylen,vallen). Once ordered, each record is output into a SeqFile and the partition offsets recorded0K2.readFields(DataInput)V2.readFields(DataInput)SequenceFile::append(K2,V2)
Hadoop [0.10, 0.17)map(K1,V1)p0 partition(key0,val0)*collect(K2,V2)K2.write(DataOutput)V2.write(DataOutput)BufferSorter[p0].addKeyValue(recOff, keylen, vallen)…01k-1ksortAndSpillToDisk()*If defined, the combiner is now run during the spill, separately over each partition. Values emitted from the combiner are written directly to the output partition.0K2.readFields(DataInput)V2.readFields(DataInput)*<< Combiner >>SequenceFile::append(K2,V2)
Hadoop [0.10, 0.17)map(K1,V1)p0 partition(key0,val0)*collect(K2,V2)K2.write(DataOutput)V2.write(DataOutput)BufferSorter[p0].addKeyValue(recOff, keylen, vallen)…*01k-1ksortAndSpillToDisk()01
Hadoop [0.10, 0.17)map(K1,V1)p0 partition(key0,val0)*collect(K2,V2)K2.write(DataOutput)V2.write(DataOutput)BufferSorter[p0].addKeyValue(recOff, keylen, vallen)…01k-1ksortAndSpillToDisk()01……k
Hadoop [0.10, 0.17)mergeParts()000111………………kkk
Hadoop [0.10, 0.17)mergeParts()0000111………………kkk
Hadoop [0.10, 0.17)Reduce 001…TaskTracker……kReduce k
Hadoop [0.10, 0.17)Reduce 001…TaskTracker……kReduce k
Hadoop [0.10, 0.17)Pro:Distributes the sort/merge across all maps; reducer need only merge its inputs
Much more predictable memory footprint
Shared, in-memory buffer across all partitions w/ efficient sort
Combines over each spill, defined by memory usage, instead of record count
Running the combiner doesn’t require storing a clone of each record (fewer serializ.)
In 0.16, spill was made concurrent with collection (HADOOP-1965)Con:Expanding buffers may impose a performance penalty; used memory calculated on every call to collect(K2,V2)
MergeSort copies indices on each level of recursion
Deserializing the key/value before appending to the SequenceFile is avoidable
Combiner weakened by requiring sort order and partition to remain consistent
Though tracked, BufferSort instances take non-negligible space (HADOOP-1698)OverviewHadoop (∞, 0.10)Hadoop [ 0.10, 0.17)Hadoop [0.17, 0.22]LuceneHADOOP-331HADOOP-2919CretaceousJurassicTriassic
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)io.sort.mb * io.sort.record.percent…io.sort.mb
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)KS.serialize(V2)Instead of explicitly tracking space used by record metadata, allocate a configurable amount of space at the beginning of the taskio.sort.mb * io.sort.record.percent…io.sort.mb
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)bufstartbufendbufindexbufmarkio.sort.mb * io.sort.record.percentkvstartkvendkvindexio.sort.mb
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)bufstartbufendbufindexbufmarkio.sort.mb * io.sort.record.percentkvstartkvendkvindexio.sort.mbkvoffsetskvindicesPartition no longer implicitly tracked. Store (partition, keystart,valstart) for every record collectedkvbuffer
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)bufstartbufendkvstartkvendkvindexbufindexbufmark
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)bufstartbufendkvstartkvendkvindexbufmarkbufindex
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)bufstartbufendkvstartkvendkvindexbufmarkbufindex
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)bufstartbufendkvstartkvendkvindexp0bufmarkbufindex
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)bufstartbufendkvstartkvendkvindexio.sort.spill.percentbufindexbufmark
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)bufstartkvstartkvendkvindexbufendbufindexbufmark
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)bufstartkvstartkvendkvindexbufindexbufmarkbufend
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)bufstartbufindexbufmarkkvstartkvindexkvendbufend
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)bufindexbufmarkkvindexkvstartkvendbufstartbufend
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)Invalid segments in the serialization buffer are marked by bufvoidRawComparator interface requires that the key be contiguous in the byte[]bufmarkbufvoidbufindexkvindexkvstartkvendbufstartbufend
Hadoop [0.17, 0.22)map(K1,V1)p0  partition(key0,val0)*SerializationKS.serialize(K2)collect(K2,V2)VS.serialize(V2)bufvoidbufmarkbufindexkvindexkvstartkvendbufstartbufend
Hadoop [0.17, 0.22)Pro:Predictable memory footprint, collection (though not spill) agnostic to number of reducers. Most memory used for the sort allocated upfront and maintained for the full task duration.
No resizing of buffers, copying of serialized record data or metadata

Ordered Record Collection

Editor's Notes

  • #3 Every presenter must include a slide like this one, and protocol demands that it contain no fewer than 5 inaccuracies