Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Find out how to improve the job run time by 25% using Apache Spark RDD Join. We have used the technique discussed in this presentation to reduce the execution time from 4 seconds to 3 seconds.

Experience the power of Apache Spark with Imaginea
1) Imaginea is among the top contributor to Spark code
2) Building products on Spark since 2014
3) Opensource contributors to Apache Hadoop and Zeppelin

Published in: Data & Analytics


  1. 1. Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve. APACHE SPARKTM RDD JOIN TO REDUCE JOB RUN TIME Insights from Imaginea
  2. 2. THE PROBLEM Over 1 TB data would be received as rows of tuple [ ID, v1, v2, … vn ] from S3 The need was to find the aggregate of values by their IDs and store it into HDFS As we performed join between the existing data and the incremental data on Spark, there would be HUGE AMOUNT OF DATA SHUFFLE across the cluster We wanted to reduce this shuffle and thus reduce the job run time
  3. 3. THE APPROACH SAME ID GOES TO SAME PARTITION So, given the fact that aggregation of each ID in one dataset has to be matched with the same ID in the other data set, we partitioned the data sets in such a way that the rows with same ID go to the same partition and thus on the same Spark worker With this approach, rows were joined locally and the costly shuffle over network was avoided
  4. 4. HASHPARTITIONER ON THE RDDs THE HURDLE 1. Even though the HashPartitioner will divide data based on keys, it will not enforce the node affinity 2. Thus, the amount of data shuffle did not reduce The approach is to use an HashPartitioner on the RDDs which partition data based on the key hash
  5. 5. So, we dug into the Spark code to figure out a way to reduce data shuffle
  6. 6. Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve. OUR SOLUTION
  7. 7. STEP 1: OVERRIDE TASKSCHEDULER PROCESS TO ENSURE DATA ENTER ONE SINGLE NODE This can be implemented as following (delegate everything to underlying RDD but re-implement getPrefferedLocations) 1 2 3 4 5 6 7 class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) { val nodeIPs = Array("","","") override def getPreferredLocations(split: Partition): Seq[String] = Seq(nodeIPs(split.index % nodeIPs.length)) } TaskScheduler assigns worker nodes to partition. Override this process in the new wrapper RDD to ensure that the data always goes to the same node.
  8. 8. STEP 2: WRITE SPARK CODES TO RUN A JOB AND MAKE EDITS TO EXECUTE A JOIN 1. Take a trial dataset 2. Move them to HDFS 3. Run a job which is very simple, do a couple of transformations and finally do dsRdd.join(devRdd) 1 2 3 4 5 6 7 8 val r1 = sc.textFile("hdfs://<a href="" target="_blank"></a>") val r2 = sc.textFile("hdfs://<a href="" target="_blank"></a>") val dsRdd = =&gt; <b>&lt;some transformation&gt;</b>).map(tokens =&gt; <b>&lt;some more&gt;</b>) val devRDD = =&gt; <b>&lt;some transformation&gt;</b>).map(tokens =&gt; <b>&lt;some more&gt;</b>) // finally join and materialize dsRdd.join(devRDD, dummy).count
  9. 9. Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve. UNDERSTANDING THE JOB RUN
  10. 10. WHY A THREE STAGE PROCESS? After the job is run, this is the result that Spark UI delivered. There are 3 stages. Stage 1 and stage 2 produce shuffle data which is consumed as a whole in stage 3.
  11. 11. DAG FOR STAGE 0 It starts with reading the “random1” file, and then applies the two map functions on that RDD. The same will be done with file “random2” in stage 1, which will have the same DAG.
  12. 12. DAG FOR STAGE 2 Only one block corresponding to the “join” method call on dsRdd. So this corresponds to the dsRdd.join(devRDD, dummy). Here we observe that there are shuffle boundaries involved in the job when ideally there’s no reason for them to be. And hence a look at CoGrouped RDD is given to see what is causing the shuffle boundary between the join and the previous stages.
  13. 13. ANALYSING COGROUPED RDD If the RDD’s being joined do not have exact same partitioner as the one for this RDD, then they are marked as ShuffleDependency (in else block). Clearly the dsRdd and devRdd went through this path — and both were marked as separate stages coming into “join”. Hence we get three stages. 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 override def getDependencies: Seq[Dependency[_]] = { { rdd: RDD[_] =&gt; if (rdd.partitioner == Some(part)) { logDebug("Adding one-to-one dependency with " + rdd) new OneToOneDependency(rdd) } else { logDebug("Adding shuffle dependency with " + rdd) new ShuffleDependency[K, Any, CoGroupCombiner]( rdd.asInstanceOf[RDD[_ &lt;: Product2[K, _]]], part, serializer) } } }
  14. 14. WRAP TWO RDDs WITHOUT REPARTITIONING Without repartitioning the two RDD’s can be wrapped. Delegate everything to the underlying RDD’s by plugging in the dummy partitioner. This will make CoGroupedRdd to report that there are no stage boundaries and the DAG Scheduler will schedule everything locally on each worker. Here’s the (very small) code for WrapRDD: 1 2 3 4 5 6 7 8 9 10 11 12 class WrapRDD[T: ClassTag](rdd: RDD[T], part: Partitioner) extends RDD[T](rdd.sparkContext, rdd.dependencies) { @DeveloperApi override def compute(split: Partition, context: TaskContext): Iterator[T] = rdd.compute(split, context) override protected def getPartitions: Array[Partition] = rdd.partitions // ********* main thing/hack ******* /// override val partitioner = Some(part) }
  15. 15. Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve. THE RESULT
  16. 16. ONE STAGE JOB RUN After the changes are made, it can be seen that there is only one stage. The whole data is read (51 X 2 = 102 MB) and nothing is written or read from shuffle.
  17. 17. EXECUTION TIME IMPROVED BY 25%, FROM 4 SECOND TO 3 SECONDS The two RDD’s that were computed in separate stages (0 and 1) earlier are now part of the same stage because they were wrapped in WrapRDD. Also, the difference in the execution time: In the first case it was 4 seconds and now it is 3 seconds. A good 25% improvement. This is because the process doesn’t have to go to the disk for writing the shuffle and then reading it back immediately. Instead it works on intermediate data right away.
  18. 18. IN SUMMARY … Good workaround was found to substantiate a process with a large amount of data that needs to be analyzed and segregated This in turn improves the quality of work and delivery time while using Spark RDD’s Execution time improved by 25%, from 4 second to 3 seconds
  19. 19. EXPERIENCE THE POWER OF APACHE SPARK WITH IMAGINEA  Imaginea is among the top contributor to Spark code  Building products on Spark since 2014  Opensource contributors to Apache Hadoop and Zeppelin To find out more, visit
  20. 20. ABOUT THE AUTHOR SACHIN TYAGI Head – Data Engineering, Imaginea Sachin heads the Data Engineering & Analytics practice at Imaginea. With over 10 years of IT experience, he brings in both Data Science & Data Engineering expertise to solve complex problems in Big Data & Machine Learning. At Imaginea, Sachin has been pivotal in implementing Apache Spark solutions to several FAST 500 companies in the areas such as Predictive Recommendation, Anomaly Detection & Contextual Search.
  21. 21. Disclaimer This document may contain forward-looking statements concerning products and strategies. These statements are based on management's current expectations and actual results may differ materially from those projected, as a result of certain risks, uncertainties and assumptions, including but not limited to: the growth of the markets addressed by our products and our customers' products, the demand for and market acceptance of our products; our ability to successfully compete in the markets in which we do business; our ability to successfully address the cost structure of our offerings; the ability to develop and implement new technologies and to obtain protection for the related intellectual property; and our ability to realize financial and strategic benefits of past and future transactions. These forward-looking statements are made only as of the date indicated, and the company disclaims any obligation to update or revise the information contained in any forward-looking statements, whether as a result of new information, future events or otherwise. All Trademarks and other registered marks belong to their respective owners. Copyright © 2012-2015, Imaginea Technologies, Inc. and/or its affiliates. All rights reserved. Credits Images under Creative Commons Zero license. Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve.