Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Programming in Spark - Lessons Learned in OpenAire project


Published on

Programming in Spark - typical problems and solutions (programming language, coding practices, code serialization and caching - kryo, storage format - avro, piping to external programs, automated testing).

Published in: Data & Analytics
  • Be the first to comment

Programming in Spark - Lessons Learned in OpenAire project

  1. 1. Programming in Spark Lessons Learned in OpenAire project Łukasz Dumiszewski, ICM, University of Warsaw, 10.2016 Duration: 1h, Requirements: knowledge of Apache Spark
  2. 2. Goals of work in OpenAire (IIS)  Rewriting of OpenAire (IIS) from MR/ Pig to Spark (several modules i.e. citation-matching)  Improvement of project structure  Enhancement of integration tests  Creation of new modules: matching of publication affiliations, IIS execution report etc.
  3. 3. Problems and solutions  Programming language  Coding standards  Data serialization and cache  Code serialization  Data storage format  Accumulators  Piping to external programs  Testing
  4. 4. Programming language Java8: no problems encountered, friendly Java Spark API, readable code.
  5. 5. Coding standards Standard programming practices (low-coupling, high- cohesion). Possible use of Spring for dependency injection. Pros: code readability and reliability, easy development and testing See: AffMatchingService AffMatchingJob
  6. 6. Data serialization and cache KryoSerialization – fast and efficient conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); Problem with deseralization of Avro collections (Avro specific list implementation). Solved by implementing a custom Kryo registrator: conf.set("spark.kryo.registrator", "$AvroCompatibleKryoRegistrator"); The registrator can be found at:
  7. 7. Code serialization Big misunderstanding in books and on the Internet (?) „Learning Spark” in the paragraph describing the change of spark.serializer to Kryo: Whether using Kryo or Java’s serializer, you may encounter a NotSerializableException if your code refers to a class that does not extend Java’s Serializable interface … The setting spark.serializer does not refer to code serialization but to data serialization. It is spark.closure.serializer that corresponds to code serialization, and it uses java serialization by default (it is not recommended that one change it due to the small amount of data serialized/ sent in this case). For this reason classes have to implement Serializable (or Externalizable). Otherwise we get NotSerializableException.
  8. 8. Code serialization or mapPartitions? It does not make sense to write functions that operate on partitions (like mapPartitions) and to create service beans in these partitions only to avoid the serialization and sending of the code between nodes. Serializing and copying the code does not have a big influence on the efficiency of an application. Using mapPartitions complicates the code and makes it difficult to write unit tests. void execute() { rdd.mapPartitons(iter -> { SomeService service = new SomeService(); service.generate… ... return someCollection; }) }
  9. 9. Data storage format To read an avro file you can use a standard Hadoop API: JavaPairRDD<AvroKey<T>, NullWritable> inputRecords = (JavaPairRDD<AvroKey<T>, NullWritable>) sc.newAPIHadoopFile(avroDatastorePath, AvroKeyInputFormat.class, avroRecordClass, NullWritable.class, job.getConfiguration());
  10. 10. Data storage format Problem: when using the standard hadoop API in Spark, you can come across unpredictable errors, because the hadoop record reader reuses the same Writable object for all records read. This is not a problem in the case of MapReduce jobs where each record is processed separately. In Spark, however, it can sometimes lead to undesired effects. For example, in the case of caching an rdd only the last object read will be cached (multiple times, equal to the number of all records read). This probably has something in common with creating multiple references to the same object. To eliminate this phenomenon, one should clone each avro record after it has been read. See: spark-utils/SparkAvroLoader JavaRDD<DocumentToProject> docProjects = avroLoader.loadJavaRDD(sc, inputPath, DocumentToProject.class);
  11. 11. Usage of accumulators At first IIS execution report based on accumulators, then just rdd counts. Use wisely (if you have to), and only in actions. Main disadvantages of accumulators:  They allow one to store data in custom structures. Naive usage can lead to memory problems (accumulators on every node, sending to drivers)  When used in transformations (map, filter etc.) - repeating tasks (in case of a node failure or memory space deficiency) can lead to incorrect accumulator values (they are calculated and increased as many times as a given transformation has been repeated). More:
  12. 12. Piping to external programs To use an external script in a Spark job, one must upload it to every node: SparkContext.addFile(path) To refer to it one should use (it is advised to do so in the comment to addFile): SparkFiles.get(fileName) It is just that… it only works in local mode! In cluster mode the path is different - the script files are in the working directory of each node. Experience: many non-repeatable errors (everything was fine when a node was on the same server as the driver). For solution see: DocumentClassificationJob
  13. 13. Unit tests  Write unit tests as for any other java code. It is not difficult if the code is written properly (just as it is in the case of non-distributed computing).  Mocking JavaRDD is not a problem.  Testing functions (lambda expressions) is a bit tedious. See: AffMatchingServiceTest
  14. 14. Testing spark jobs Helpful classes that facilitate the testing of spark jobs: spark-utils/test
  15. 15. Spark as an action in Oozie workflow Only one jar can be passed to ‘spark submit’ in an oozie action. Use maven shade plugin or similar tool to merge many jars into one.
  16. 16. Integration tests of oozie workflows While working on IIS, one can fire the oozie workflow integration tests from IDE (Eclipse, NetBeans). Proper code creates oozie packages, sends them to a server, polls for the job status and compares the results with those expected. See:
  17. 17. Conclusions  It is easier to write and to test a Spark job than an equivalent chain of M-R jobs.  Efficiency: after it has been rewritten from MR to Spark the CitationMatching module execution time fell from 28h to 10h (the comparison is far from perfect because each version was run on a different cluster).  Debugging is difficult.  Easy integration with Oozie. Oozie workflows are less complex, a lot of logic has been moved to Spark.