Join Optimization in HiveLiyin Tang
OutlineMap Join OptimizationPrevious Common Join and Map JoinOptimized Map JoinJDBMPerformance EvaluationConvert Join to Map Join AutomaticallyHow it worksPerformance Evaluation
Common JoinTask ATable XTable YCommon Join TaskMapperMapperMapper…Mapper…Mapper…MapperShuffleReducer
Previous Map JoinTask ASmall Table DataMapJoin TaskMapper……RecordMapper………Big Table DataRecordMapper………RecordRecordRecordTask C……
Optimized Map JoinSmall Table DataSmall Table DataSmall Table DataTask AUpload files to DCHashTable FilesHashTable FilesHashTable  FilesMapReduce Local TaskDistributed CacheMapJoin TaskMapper…RecordMapper…RecordMapper…Big Table DataRecordRecord……Task C
JDBMJDBM is too heavy weight for Map JoinTake more than 70% CPU timeGenerate very large fileNo need to use persistent hashtable for map join
Performance Evaluation I
Converting Common Join into Map JoinTask ATask AConditional TaskaCommonJoinTaskMapJoinLocalTaskMapJoinLocalTaskMapJoinLocalTaskbCommonJoinTask. . . . . Task CcMapJoinTaskMapJoinTaskMapJoinTaskPrevious Execution Flow Task COptimized Execution Flow
Compile TimeSELECT * FROM SRC1 x JOIN SRC2 y ON x.key = y.key;Task AaConditional TaskAssume TABLE x is the big tableAssume TABLE y is the big tableMapJoinLocalTaskMapJoinLocalTaskCommonJoinTaskMapJoinTaskMapJoinTaskTask C
Execution TimeSELECT * FROM SRC1 x JOIN SRC2 y ON x.key = y.key;Task ABoth tables are too big for map joinTable X is the big tableaConditional TaskMapJoinLocalTaskCommonJoinTaskMapJoinTaskTask C
Backup TaskTask AConditional TaskMemory BoundMapJoinLocalTaskRun as a Backup TaskCommonJoinTaskMapJoinTaskTask C
Performance BottleneckDistributed Cache is the potential performance bottleneckLarge hashtable file will slow down the propagation of Distributed CacheMappers are waiting for the hashtables file from Distributed CacheCompress and archive all the hashtable file into a tar file.
Compress and Archive Small Table DataSmall Table DataSmall Table DataTask ACompressed & ArchivedaHashTable FilesHashTable FilesHashTable FilesMapReduce Local TaskDistributed CacheMapper…RecordMapper…RecordMapper…Big Table DataRecordbRecordMapJoin Task……Task C
Performance Evaluation II
Performance Evaluation III
Future WorkAudit how many join will be converted into map join in the cluster.Set hashtable file replica number based on the number of MappersTune the limit of small table data size by samplingIncrease the in-memory hashtable capacity.
Thank youLiyin Tang

Join optimization in hive

  • 1.
    Join Optimization inHiveLiyin Tang
  • 2.
    OutlineMap Join OptimizationPreviousCommon Join and Map JoinOptimized Map JoinJDBMPerformance EvaluationConvert Join to Map Join AutomaticallyHow it worksPerformance Evaluation
  • 3.
    Common JoinTask ATableXTable YCommon Join TaskMapperMapperMapper…Mapper…Mapper…MapperShuffleReducer
  • 4.
    Previous Map JoinTaskASmall Table DataMapJoin TaskMapper……RecordMapper………Big Table DataRecordMapper………RecordRecordRecordTask C……
  • 5.
    Optimized Map JoinSmallTable DataSmall Table DataSmall Table DataTask AUpload files to DCHashTable FilesHashTable FilesHashTable FilesMapReduce Local TaskDistributed CacheMapJoin TaskMapper…RecordMapper…RecordMapper…Big Table DataRecordRecord……Task C
  • 6.
    JDBMJDBM is tooheavy weight for Map JoinTake more than 70% CPU timeGenerate very large fileNo need to use persistent hashtable for map join
  • 7.
  • 8.
    Converting Common Joininto Map JoinTask ATask AConditional TaskaCommonJoinTaskMapJoinLocalTaskMapJoinLocalTaskMapJoinLocalTaskbCommonJoinTask. . . . . Task CcMapJoinTaskMapJoinTaskMapJoinTaskPrevious Execution Flow Task COptimized Execution Flow
  • 9.
    Compile TimeSELECT *FROM SRC1 x JOIN SRC2 y ON x.key = y.key;Task AaConditional TaskAssume TABLE x is the big tableAssume TABLE y is the big tableMapJoinLocalTaskMapJoinLocalTaskCommonJoinTaskMapJoinTaskMapJoinTaskTask C
  • 10.
    Execution TimeSELECT *FROM SRC1 x JOIN SRC2 y ON x.key = y.key;Task ABoth tables are too big for map joinTable X is the big tableaConditional TaskMapJoinLocalTaskCommonJoinTaskMapJoinTaskTask C
  • 11.
    Backup TaskTask AConditionalTaskMemory BoundMapJoinLocalTaskRun as a Backup TaskCommonJoinTaskMapJoinTaskTask C
  • 12.
    Performance BottleneckDistributed Cacheis the potential performance bottleneckLarge hashtable file will slow down the propagation of Distributed CacheMappers are waiting for the hashtables file from Distributed CacheCompress and archive all the hashtable file into a tar file.
  • 13.
    Compress and ArchiveSmall Table DataSmall Table DataSmall Table DataTask ACompressed & ArchivedaHashTable FilesHashTable FilesHashTable FilesMapReduce Local TaskDistributed CacheMapper…RecordMapper…RecordMapper…Big Table DataRecordbRecordMapJoin Task……Task C
  • 14.
  • 15.
  • 16.
    Future WorkAudit howmany join will be converted into map join in the cluster.Set hashtable file replica number based on the number of MappersTune the limit of small table data size by samplingIncrease the in-memory hashtable capacity.
  • 17.

Editor's Notes

  • #4 A common join in hive will involve a Map stage and a Reduce stageAs we all know, shuffle stage before reducer is expensive, they need to sort and merge the intermediate file. So we tried to avoid this stage whenever is possible.
  • #5 That’s the motivation of the map join.When one of the table is small enough to fit into the memory, so all the Mapper can hold the data in memory and do the join work in memory.So in this way, there is no shuffle/reduce stage is needed.That is how the previous map join works However the previous map join does not scaleThousands of Mapper read the small table from HDFS into memory, it will easily cause the small table to be the performance bottleneckAlso they will get read time out when the small table data became to be the hot spot. So that’s the problem of the previous map join
  • #6 I tried to solve this problemWe create a map reduce local task, which will run locally and read the small table data into memory and serialize the hashtable into files.After that, it will upload the files into Distributed CacheWhen the Map Join Task is launched, the DC will propagate the hashtable files to each mapper’s local file system.And each mapper will load them back into the memory and do the join work as before.By doing this, we need to read the small table data only once. Also we use the DC to push the data to Mapper, instead of pulling data from HDFS by mapper itself. The difference is if multiple mappers runs on the same machine, DC only needs to push the data once.
  • #7 Another optimization is to remove JDBM component from hive.JDBM is the persistent hash table used in Hive.Whenever the in-memory hashtable cannot hold data any more, it will swap the key/value into the JDBM table. It’s like a backup storageBy profiling, we found out JDBM is too heavy weight for map join.Take more than 70% CPU time when call the get function from JDBM.The generated Hashtable file is too large to propagate in DCActually, there is no need to use persistent hash table for map join. If the table is too large to fit in the memory, they should not run the query as a map join.Right now, we have totally removed this component from hive.
  • #8 We run several benchmark to how much performance improvement after optimization.The result of benchmark shows the new optimized map join will be 12~26 times faster the previous one.I have to mention that the performance improvement is not only because of introducing the Distributed Cache to propagate the hashtable file,But also we have optimized the map join code and remove a very heavy weight persistent hashtable component from Hive.
  • #9 since the new map join has a very good performance, Hive should try to run map join instead of common join whenever is possible.Previously, if user wants to do the map join, he needs to give the hints in query to assign which table is the small table 2) So the mapper can hold that data in the memory.3)But basically, not all of the users will givethis hint or users may gave a wrong hint in the query.4) So getting the hint from user is not good for user experience and query performance.6) So my work is to automatically and dynamically convert the common join into map join during the run time.7) Automatic means users don’t need to give the hint in the query any more.8) Dynamic means in the compile time, Hive will generate a series of execution flow, each of the execution flow covers one possible situation. (The number of execution path is bounded by the number of join tables.)12) During the execution, Hive will choose the most efficient execution path to run based on input file size.
  • #10 Let’s take an example: There are 2 tables join together on a join key. Let’s say table x and table y.So during the compile time, it will generate 3 execution paths. First execution path will do the map join by assuming table x is the big table and loading the other tables into memory, which is the table y.The second will also the do the map join by assuming table y is the big table.And finally, 3rd execution path assume both of the table is too large to fit into memory, not feasible for map join, it will run the original common join.Why we are doing this? It’s because we don’t the data size of join table during compile time. Some of the join table may be the intermediate tables, which is generated from some sub query at run time.We only know the data size during execution time.
  • #11 Let’s see what will happen during the execution time.When task A is finished, we exactly know the data size of each join table.If table X is the big table and the other table is small enough to fit into memory, Hive will run this execution path.However, if none of the tables can be load into memory, Hive will run the original common join execution path.By doing this, Hive can dynamically choose the most efficient execution path to run at execution stage.
  • #12 Since the local task needs to load the data into memory, it is a very memory intensive task.So hivewill launch this local task in a child jvm, which has the same heap size as the Mapper's.Right now, We have already carefully bounded the input data size of the small tables, but there is still possible that the local task may run out of memory.Sothe query processor will measure the memory usage of the local task very carefully. Once the memory usage of the Local Task is higher than a threshold, this Local Task will abort itself, which means the table is too large fit into memory and map join fails.In this case, the query processor will switch back the original Common Join task as a Backup Task to run, All of them is totally transparent to user.
  • #13 Let’s discuss the performance bottleneck. Previously, the small table file is the performance bottleneck,However, in the new map join, the distributed cache is the potential performance bottleneckIf you upload large hashtable file into DC, say larger than 30 M, it will really slow down the propagation speed of DC.You will see all the Mappers are launched but they will be in the initialization stage for a long time, which means they are waiting for the push from DC.Right now, the solution is very straightforward and walk around , we compress and archive all the hashtable files into tar file.
  • #15 We have run several performance benchmark to compare between with the compression and without compression.From the result, we can see the compression can help to improve the performance by 21 % ~ 86 %.Also we can see the larger the input data size is, the more performance we can get by compression, which is very reasonable.The larger the small table is, the large hashtable file it will generated.The larger the big table is, the more mappers will be launched.Both of these 2 factors will contributes to making the DC to be bottleneck problem.And compression can help to solve this situation.
  • #16 1) Finally,let’s see the performance comparison between previous common join with the new optimized common join, which is converted map join.2) Because all the tested benchmark is valid to convert into map join. 3) The result shows the join performance will be improved by 57% to 163%, if the join can be converted into map join.
  • #17 There are several future works to follow. First, when this new optimized Join is running in the cluster, it would be better to know how many join operation is converted in Map Join and how many converted Map Join fails because it runs out of memory.We have already developed the hooks to audit but still need time to deploy in the cluster.Another thing is to set up the number of replications for the compressed hashtable files based on the number of Mappers. Currently, the number of replications for thehashtable files is 3. If we can set the replication number based on number of mappers, it will improve the propagation speed.we manually bounded the table size of small table to be 25M, which may be a little conservative. If we can tune this parameters by sampling the data, we will get more accurate limit of map join and more queries can be convert into map join.Finally,the local task can hold 2M unique key/value in the memory by consuming 1.47G memory space.By optimization to be more memory efficient, the local task can hold more data in memory. So more join queries can be converted into Map Join.