For anyone who has been working on MapReduce, there is this age-old problem around “how do I figure out the correct number of reducers?”. We guess some number at compile-time and usually that turns out to be incorrect at run-time. Let’s see how we can use the Tez model to fix that. So here is this Map Vertex and this Reduce Vertex, which have these tasks running and you have the Vertex Manager running inside the framework …
[CLICK] The Map Tasks can send Data Size Statistics to the Vertex Manager, which can then extrapolate those statistics to figure out “what would be the final size of the data when all of these Maps finish?”. Based on that, it can realize that the data size is actually smaller than expected, and I can actually run two reduce tasks instead of three.
[CLICK] The Vertex Manager sends a Set Paralellism command to the framework which changes the routing information in-between these two tasks and also cancels the last task.
query 1: SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
1.5x to 3x speedup on some of the Pigmix queries.
Hive has written it’s own processor
Apache Tez - A New Chapter in Hadoop Data Processing
Tez – Pig performance gains
• Demonstrates performance gains from a basic translation to a
Prod script 1
25m vs 10m
5 MR Jobs
Prod script 2
34m vs 16m
5 MR Jobs
Prod script 3
1h 46m vs 48m
12 MR Jobs
Prod script 4
2h 22m vs 1h
15 MR jobs