A common join in hive will involve a Map stage and a Reduce stageAs we all know, shuffle stage before reducer is expensive, they need to sort and merge the intermediate file. So we tried to avoid this stage whenever is possible.
That’s the motivation of the map join.When one of the table is small enough to fit into the memory, so all the Mapper can hold the data in memory and do the join work in memory.So in this way, there is no shuffle/reduce stage is needed.That is how the previous map join works However the previous map join does not scaleThousands of Mapper read the small table from HDFS into memory, it will easily cause the small table to be the performance bottleneckAlso they will get read time out when the small table data became to be the hot spot. So that’s the problem of the previous map join
I tried to solve this problemWe create a map reduce local task, which will run locally and read the small table data into memory and serialize the hashtable into files.After that, it will upload the files into Distributed CacheWhen the Map Join Task is launched, the DC will propagate the hashtable files to each mapper’s local file system.And each mapper will load them back into the memory and do the join work as before.By doing this, we need to read the small table data only once. Also we use the DC to push the data to Mapper, instead of pulling data from HDFS by mapper itself. The difference is if multiple mappers runs on the same machine, DC only needs to push the data once.
Another optimization is to remove JDBM component from hive.JDBM is the persistent hash table used in Hive.Whenever the in-memory hashtable cannot hold data any more, it will swap the key/value into the JDBM table. It’s like a backup storageBy profiling, we found out JDBM is too heavy weight for map join.Take more than 70% CPU time when call the get function from JDBM.The generated Hashtable file is too large to propagate in DCActually, there is no need to use persistent hash table for map join. If the table is too large to fit in the memory, they should not run the query as a map join.Right now, we have totally removed this component from hive.
We run several benchmark to how much performance improvement after optimization.The result of benchmark shows the new optimized map join will be 12~26 times faster the previous one.I have to mention that the performance improvement is not only because of introducing the Distributed Cache to propagate the hashtable file,But also we have optimized the map join code and remove a very heavy weight persistent hashtable component from Hive.
since the new map join has a very good performance, Hive should try to run map join instead of common join whenever is possible.Previously, if user wants to do the map join, he needs to give the hints in query to assign which table is the small table 2) So the mapper can hold that data in the memory.3)But basically, not all of the users will givethis hint or users may gave a wrong hint in the query.4) So getting the hint from user is not good for user experience and query performance.6) So my work is to automatically and dynamically convert the common join into map join during the run time.7) Automatic means users don’t need to give the hint in the query any more.8) Dynamic means in the compile time, Hive will generate a series of execution flow, each of the execution flow covers one possible situation. (The number of execution path is bounded by the number of join tables.)12) During the execution, Hive will choose the most efficient execution path to run based on input file size.
Let’s take an example: There are 2 tables join together on a join key. Let’s say table x and table y.So during the compile time, it will generate 3 execution paths. First execution path will do the map join by assuming table x is the big table and loading the other tables into memory, which is the table y.The second will also the do the map join by assuming table y is the big table.And finally, 3rd execution path assume both of the table is too large to fit into memory, not feasible for map join, it will run the original common join.Why we are doing this? It’s because we don’t the data size of join table during compile time. Some of the join table may be the intermediate tables, which is generated from some sub query at run time.We only know the data size during execution time.
Let’s see what will happen during the execution time.When task A is finished, we exactly know the data size of each join table.If table X is the big table and the other table is small enough to fit into memory, Hive will run this execution path.However, if none of the tables can be load into memory, Hive will run the original common join execution path.By doing this, Hive can dynamically choose the most efficient execution path to run at execution stage.
Since the local task needs to load the data into memory, it is a very memory intensive task.So hivewill launch this local task in a child jvm, which has the same heap size as the Mapper's.Right now, We have already carefully bounded the input data size of the small tables, but there is still possible that the local task may run out of memory.Sothe query processor will measure the memory usage of the local task very carefully. Once the memory usage of the Local Task is higher than a threshold, this Local Task will abort itself, which means the table is too large fit into memory and map join fails.In this case, the query processor will switch back the original Common Join task as a Backup Task to run, All of them is totally transparent to user.
Let’s discuss the performance bottleneck. Previously, the small table file is the performance bottleneck,However, in the new map join, the distributed cache is the potential performance bottleneckIf you upload large hashtable file into DC, say larger than 30 M, it will really slow down the propagation speed of DC.You will see all the Mappers are launched but they will be in the initialization stage for a long time, which means they are waiting for the push from DC.Right now, the solution is very straightforward and walk around , we compress and archive all the hashtable files into tar file.
We have run several performance benchmark to compare between with the compression and without compression.From the result, we can see the compression can help to improve the performance by 21 % ~ 86 %.Also we can see the larger the input data size is, the more performance we can get by compression, which is very reasonable.The larger the small table is, the large hashtable file it will generated.The larger the big table is, the more mappers will be launched.Both of these 2 factors will contributes to making the DC to be bottleneck problem.And compression can help to solve this situation.
1) Finally,let’s see the performance comparison between previous common join with the new optimized common join, which is converted map join.2) Because all the tested benchmark is valid to convert into map join. 3) The result shows the join performance will be improved by 57% to 163%, if the join can be converted into map join.
There are several future works to follow. First, when this new optimized Join is running in the cluster, it would be better to know how many join operation is converted in Map Join and how many converted Map Join fails because it runs out of memory.We have already developed the hooks to audit but still need time to deploy in the cluster.Another thing is to set up the number of replications for the compressed hashtable files based on the number of Mappers. Currently, the number of replications for thehashtable files is 3. If we can set the replication number based on number of mappers, it will improve the propagation speed.we manually bounded the table size of small table to be 25M, which may be a little conservative. If we can tune this parameters by sampling the data, we will get more accurate limit of map join and more queries can be convert into map join.Finally,the local task can hold 2M unique key/value in the memory by consuming 1.47G memory space.By optimization to be more memory efficient, the local task can hold more data in memory. So more join queries can be converted into Map Join.
Join optimization in hive
Join Optimization in Hive<br />Liyin Tang<br />
Outline<br />Map Join Optimization<br />Previous Common Join and Map Join<br />Optimized Map Join<br />JDBM<br />Performance Evaluation<br />Convert Join to Map Join Automatically<br />How it works<br />Performance Evaluation<br />
Compile Time<br />SELECT * FROM <br />SRC1 x JOIN SRC2 y <br />ON x.key = y.key;<br />Task A<br />a<br />Conditional Task<br />Assume TABLE x is the big table<br />Assume TABLE y is the big table<br />MapJoinLocalTask<br />MapJoinLocalTask<br />CommonJoinTask<br />MapJoinTask<br />MapJoinTask<br />Task C<br />
Execution Time<br />SELECT * FROM <br />SRC1 x JOIN SRC2 y <br />ON x.key = y.key;<br />Task A<br />Both tables are too big for map join<br />Table X is the big table<br />a<br />Conditional Task<br />MapJoinLocalTask<br />CommonJoinTask<br />MapJoinTask<br />Task C<br />
Backup Task<br />Task A<br />Conditional Task<br />Memory Bound<br />MapJoinLocalTask<br />Run as a Backup Task<br />CommonJoinTask<br />MapJoinTask<br />Task C<br />
Performance Bottleneck<br />Distributed Cache is the potential performance bottleneck<br />Large hashtable file will slow down the propagation of Distributed Cache<br />Mappers are waiting for the hashtables file from Distributed Cache<br />Compress and archive all the hashtable file into a tar file.<br />
Future Work<br />Audit how many join will be converted into map join in the cluster.<br />Set hashtable file replica number based on the number of Mappers<br />Tune the limit of small table data size by sampling<br />Increase the in-memory hashtable capacity.<br />