SlideShare a Scribd company logo
Programming in Spark
Lessons Learned in OpenAire project
Łukasz Dumiszewski, ICM, University of Warsaw, 10.2016
Duration: 1h, Requirements: knowledge of Apache Spark
Goals of work in OpenAire (IIS)
 Rewriting of OpenAire (IIS) from MR/ Pig to Spark (several
modules i.e. citation-matching)
 Improvement of project structure
 Enhancement of integration tests
 Creation of new modules: matching of publication
affiliations, IIS execution report etc.
Problems and solutions
 Programming language
 Coding standards
 Data serialization and cache
 Code serialization
 Data storage format
 Accumulators
 Piping to external programs
 Testing
Programming language
Java8: no problems encountered, friendly Java Spark API,
readable code.
Coding standards
Standard programming practices (low-coupling, high-
cohesion). Possible use of Spring for dependency injection.
Pros: code readability and reliability, easy development and
testing
See:
AffMatchingService
AffMatchingJob
Data serialization and cache
KryoSerialization – fast and efficient
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
Problem with deseralization of Avro collections (Avro specific list
implementation).
Solved by implementing a custom Kryo registrator:
conf.set("spark.kryo.registrator",
"eu.dnetlib.iis.wf.affmatching.AffMatchingJob$AvroCompatibleKryoRegistrator");
The registrator can be found at: https://github.com/CeON/spark-utils/
Code serialization
Big misunderstanding in books and on the Internet (?)
„Learning Spark” in the paragraph describing the change of spark.serializer to Kryo:
Whether using Kryo or Java’s serializer, you may encounter a NotSerializableException if
your code refers to a class that does not extend Java’s Serializable interface …
The setting spark.serializer does not refer to code serialization but to data
serialization.
It is spark.closure.serializer that corresponds to code serialization, and it uses java
serialization by default (it is not recommended that one change it due to the small
amount of data serialized/ sent in this case). For this reason classes have to
implement Serializable (or Externalizable). Otherwise we get
NotSerializableException.
Code serialization or mapPartitions?
It does not make sense to write functions that operate on partitions (like
mapPartitions) and to create service beans in these partitions only to avoid the
serialization and sending of the code between nodes. Serializing and copying the
code does not have a big influence on the efficiency of an application. Using
mapPartitions complicates the code and makes it difficult to write unit tests.
void execute() {
rdd.mapPartitons(iter -> {
SomeService service = new SomeService();
service.generate…
...
return someCollection;
})
}
Data storage format
To read an avro file you can use a standard Hadoop API:
JavaPairRDD<AvroKey<T>, NullWritable> inputRecords = (JavaPairRDD<AvroKey<T>,
NullWritable>) sc.newAPIHadoopFile(avroDatastorePath, AvroKeyInputFormat.class,
avroRecordClass, NullWritable.class, job.getConfiguration());
Data storage format
Problem: when using the standard hadoop API in Spark, you can come across
unpredictable errors, because the hadoop record reader reuses the same
Writable object for all records read.
This is not a problem in the case of MapReduce jobs where each record is processed
separately. In Spark, however, it can sometimes lead to undesired effects. For example, in
the case of caching an rdd only the last object read will be cached (multiple times, equal to
the number of all records read). This probably has something in common with creating
multiple references to the same object.
To eliminate this phenomenon, one should clone each avro record after it has been read.
See: spark-utils/SparkAvroLoader
JavaRDD<DocumentToProject> docProjects = avroLoader.loadJavaRDD(sc, inputPath,
DocumentToProject.class);
Usage of accumulators
At first IIS execution report based on accumulators, then just rdd counts.
Use wisely (if you have to), and only in actions.
Main disadvantages of accumulators:
 They allow one to store data in custom structures. Naive usage can lead
to memory problems (accumulators on every node, sending to drivers)
 When used in transformations (map, filter etc.) - repeating tasks (in case
of a node failure or memory space deficiency) can lead to incorrect
accumulator values (they are calculated and increased as many times as a
given transformation has been repeated).
More:
http://imranrashid.com/posts/Spark-Accumulators/
http://stackoverflow.com/questions/29494452/when-are-accumulators-truly-reliable
Piping to external programs
To use an external script in a Spark job, one must upload it to every node:
SparkContext.addFile(path)
To refer to it one should use (it is advised to do so in the comment to
addFile):
SparkFiles.get(fileName)
It is just that… it only works in local mode! In cluster mode the path is
different - the script files are in the working directory of each node.
Experience: many non-repeatable errors (everything was fine when a
node was on the same server as the driver).
For solution see: DocumentClassificationJob
Unit tests
 Write unit tests as for any other java code. It is not difficult if the code
is written properly (just as it is in the case of non-distributed computing).
 Mocking JavaRDD is not a problem.
 Testing functions (lambda expressions) is a bit tedious.
See:
AffMatchingServiceTest
Testing spark jobs
Helpful classes that facilitate the testing of spark jobs:
spark-utils/test
Spark as an action in Oozie workflow
Only one jar can be passed to ‘spark submit’ in an oozie
action.
Use maven shade plugin or similar tool to merge many jars
into one.
Integration tests of oozie workflows
While working on IIS, one can fire the oozie workflow integration tests
from IDE (Eclipse, NetBeans).
Proper code creates oozie packages, sends them to a server, polls for the
job status and compares the results with those expected.
See: AbstractOozieWorkflowTestCase.java
Conclusions
 It is easier to write and to test a Spark job than an equivalent chain of
M-R jobs.
 Efficiency: after it has been rewritten from MR to Spark the
CitationMatching module execution time fell from 28h to 10h (the
comparison is far from perfect because each version was run on a
different cluster).
 Debugging is difficult.
 Easy integration with Oozie. Oozie workflows are less complex, a lot of
logic has been moved to Spark.

More Related Content

What's hot

Jan Stepien - Introducing structure in Clojure - Codemotion Milan 2017
Jan Stepien - Introducing structure in Clojure - Codemotion Milan 2017Jan Stepien - Introducing structure in Clojure - Codemotion Milan 2017
Jan Stepien - Introducing structure in Clojure - Codemotion Milan 2017
Codemotion
 
Smart Migration to JDK 8
Smart Migration to JDK 8Smart Migration to JDK 8
Smart Migration to JDK 8
Geertjan Wielenga
 
Build, logging, and unit test tools
Build, logging, and unit test toolsBuild, logging, and unit test tools
Build, logging, and unit test tools
Allan Huang
 
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
Javantura v3 - ES6 – Future Is Now – Nenad Pečanac
Javantura v3 - ES6 – Future Is Now – Nenad PečanacJavantura v3 - ES6 – Future Is Now – Nenad Pečanac
Javantura v3 - ES6 – Future Is Now – Nenad Pečanac
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Gatling
Gatling Gatling
Gatling
Gaurav Shukla
 
Testing in Scala. Adform Research
Testing in Scala. Adform ResearchTesting in Scala. Adform Research
Testing in Scala. Adform Research
Vasil Remeniuk
 
Solid And Sustainable Development in Scala
Solid And Sustainable Development in ScalaSolid And Sustainable Development in Scala
Solid And Sustainable Development in Scala
Kazuhiro Sera
 
Java byte code & virtual machine
Java byte code & virtual machineJava byte code & virtual machine
Java byte code & virtual machine
Laxman Puri
 
JavaCro'14 - Is there Kotlin after Java 8 – Ivan Turčinović and Igor Buzatović
JavaCro'14 - Is there Kotlin after Java 8 – Ivan Turčinović and Igor BuzatovićJavaCro'14 - Is there Kotlin after Java 8 – Ivan Turčinović and Igor Buzatović
JavaCro'14 - Is there Kotlin after Java 8 – Ivan Turčinović and Igor Buzatović
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
Hiroshi Ono
 
JVM++: The Graal VM
JVM++: The Graal VMJVM++: The Graal VM
JVM++: The Graal VM
Martin Toshev
 
What is-java
What is-javaWhat is-java
What is-java
Shahid Rasheed
 
Advanced Production Debugging
Advanced Production DebuggingAdvanced Production Debugging
Advanced Production Debugging
Takipi
 
ApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data Sources
Jayesh Thakrar
 
Java 8 parallel stream
Java 8 parallel streamJava 8 parallel stream
Java 8 parallel stream
Yung Chieh Tsai
 
Graal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution PlatformGraal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution Platform
Thomas Wuerthinger
 
Graal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian WimmerGraal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian Wimmer
Thomas Wuerthinger
 
Java compilation
Java compilationJava compilation
Java compilation
Mike Kucera
 
JAVA 8 Parallel Stream
JAVA 8 Parallel StreamJAVA 8 Parallel Stream
JAVA 8 Parallel Stream
Tengwen Wang
 

What's hot (20)

Jan Stepien - Introducing structure in Clojure - Codemotion Milan 2017
Jan Stepien - Introducing structure in Clojure - Codemotion Milan 2017Jan Stepien - Introducing structure in Clojure - Codemotion Milan 2017
Jan Stepien - Introducing structure in Clojure - Codemotion Milan 2017
 
Smart Migration to JDK 8
Smart Migration to JDK 8Smart Migration to JDK 8
Smart Migration to JDK 8
 
Build, logging, and unit test tools
Build, logging, and unit test toolsBuild, logging, and unit test tools
Build, logging, and unit test tools
 
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
 
Javantura v3 - ES6 – Future Is Now – Nenad Pečanac
Javantura v3 - ES6 – Future Is Now – Nenad PečanacJavantura v3 - ES6 – Future Is Now – Nenad Pečanac
Javantura v3 - ES6 – Future Is Now – Nenad Pečanac
 
Gatling
Gatling Gatling
Gatling
 
Testing in Scala. Adform Research
Testing in Scala. Adform ResearchTesting in Scala. Adform Research
Testing in Scala. Adform Research
 
Solid And Sustainable Development in Scala
Solid And Sustainable Development in ScalaSolid And Sustainable Development in Scala
Solid And Sustainable Development in Scala
 
Java byte code & virtual machine
Java byte code & virtual machineJava byte code & virtual machine
Java byte code & virtual machine
 
JavaCro'14 - Is there Kotlin after Java 8 – Ivan Turčinović and Igor Buzatović
JavaCro'14 - Is there Kotlin after Java 8 – Ivan Turčinović and Igor BuzatovićJavaCro'14 - Is there Kotlin after Java 8 – Ivan Turčinović and Igor Buzatović
JavaCro'14 - Is there Kotlin after Java 8 – Ivan Turčinović and Igor Buzatović
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
JVM++: The Graal VM
JVM++: The Graal VMJVM++: The Graal VM
JVM++: The Graal VM
 
What is-java
What is-javaWhat is-java
What is-java
 
Advanced Production Debugging
Advanced Production DebuggingAdvanced Production Debugging
Advanced Production Debugging
 
ApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data Sources
 
Java 8 parallel stream
Java 8 parallel streamJava 8 parallel stream
Java 8 parallel stream
 
Graal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution PlatformGraal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution Platform
 
Graal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian WimmerGraal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian Wimmer
 
Java compilation
Java compilationJava compilation
Java compilation
 
JAVA 8 Parallel Stream
JAVA 8 Parallel StreamJAVA 8 Parallel Stream
JAVA 8 Parallel Stream
 

Viewers also liked

Code Review and other aspects of project organization
Code Review and other aspects of project organizationCode Review and other aspects of project organization
Code Review and other aspects of project organization
Łukasz Dumiszewski
 
df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on Spark
Alpine Data
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Spark Scala project
Spark Scala project Spark Scala project
Spark Scala project
Utkarsh Jadhav
 
Elaboration on world war 2
Elaboration on world war 2Elaboration on world war 2
Elaboration on world war 2
gilani syeda
 
Q2 teenagers
Q2 teenagersQ2 teenagers
Q2 teenagers
Brandon Hill
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
Vasil Remeniuk
 
Variance in scala
Variance in scalaVariance in scala
Variance in scala
LyleK
 
Python in real world.
Python in real world.Python in real world.
Python in real world.
Alph@.M
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
datamantra
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
Spark Summit
 
Python programming - Everyday(ish) Examples
Python programming - Everyday(ish) ExamplesPython programming - Everyday(ish) Examples
Python programming - Everyday(ish) Examples
Ashish Sharma
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
phanleson
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandra
johnrjenson
 
Lets learn Python !
Lets learn Python !Lets learn Python !
Lets learn Python !
Kiran Gangadharan
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
datamantra
 
Titan: Scaling Graphs and TinkerPop3
Titan: Scaling Graphs and TinkerPop3Titan: Scaling Graphs and TinkerPop3
Titan: Scaling Graphs and TinkerPop3
Matthias Broecheler
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 

Viewers also liked (20)

Code Review and other aspects of project organization
Code Review and other aspects of project organizationCode Review and other aspects of project organization
Code Review and other aspects of project organization
 
df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on Spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark Scala project
Spark Scala project Spark Scala project
Spark Scala project
 
Elaboration on world war 2
Elaboration on world war 2Elaboration on world war 2
Elaboration on world war 2
 
Q2 teenagers
Q2 teenagersQ2 teenagers
Q2 teenagers
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
 
Variance in scala
Variance in scalaVariance in scala
Variance in scala
 
Python in real world.
Python in real world.Python in real world.
Python in real world.
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Python programming - Everyday(ish) Examples
Python programming - Everyday(ish) ExamplesPython programming - Everyday(ish) Examples
Python programming - Everyday(ish) Examples
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandra
 
Lets learn Python !
Lets learn Python !Lets learn Python !
Lets learn Python !
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
 
Titan: Scaling Graphs and TinkerPop3
Titan: Scaling Graphs and TinkerPop3Titan: Scaling Graphs and TinkerPop3
Titan: Scaling Graphs and TinkerPop3
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
 

Similar to Programming in Spark - Lessons Learned in OpenAire project

Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
Demi Ben-Ari
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
siddharth30121
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
Majid Hajibaba
 
Java 8 Overview
Java 8 OverviewJava 8 Overview
Java 8 Overview
Nicola Pedot
 
Scala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJSScala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJS
Alberto Paro
 
Alberto Paro - Hands on Scala.js
Alberto Paro - Hands on Scala.jsAlberto Paro - Hands on Scala.js
Alberto Paro - Hands on Scala.js
Scala Italy
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
J On The Beach
 
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Codemotion
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Ike Ellis
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
Tim Ellison
 
Angular JS in 2017
Angular JS in 2017Angular JS in 2017
Angular JS in 2017
Ayush Sharma
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Viridians on Rails
Viridians on RailsViridians on Rails
Viridians on Rails
Viridians
 
Rapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and PythonRapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and Python
Rick Copeland
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
Mohit Jain
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Hubert Fan Chiang
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 

Similar to Programming in Spark - Lessons Learned in OpenAire project (20)

Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Java 8 Overview
Java 8 OverviewJava 8 Overview
Java 8 Overview
 
Scala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJSScala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJS
 
Alberto Paro - Hands on Scala.js
Alberto Paro - Hands on Scala.jsAlberto Paro - Hands on Scala.js
Alberto Paro - Hands on Scala.js
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
Angular JS in 2017
Angular JS in 2017Angular JS in 2017
Angular JS in 2017
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark core
Spark coreSpark core
Spark core
 
Viridians on Rails
Viridians on RailsViridians on Rails
Viridians on Rails
 
Rapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and PythonRapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and Python
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 

Recently uploaded

Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
yuvarajkumar334
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 

Recently uploaded (20)

Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 

Programming in Spark - Lessons Learned in OpenAire project

  • 1. Programming in Spark Lessons Learned in OpenAire project Łukasz Dumiszewski, ICM, University of Warsaw, 10.2016 Duration: 1h, Requirements: knowledge of Apache Spark
  • 2. Goals of work in OpenAire (IIS)  Rewriting of OpenAire (IIS) from MR/ Pig to Spark (several modules i.e. citation-matching)  Improvement of project structure  Enhancement of integration tests  Creation of new modules: matching of publication affiliations, IIS execution report etc.
  • 3. Problems and solutions  Programming language  Coding standards  Data serialization and cache  Code serialization  Data storage format  Accumulators  Piping to external programs  Testing
  • 4. Programming language Java8: no problems encountered, friendly Java Spark API, readable code.
  • 5. Coding standards Standard programming practices (low-coupling, high- cohesion). Possible use of Spring for dependency injection. Pros: code readability and reliability, easy development and testing See: AffMatchingService AffMatchingJob
  • 6. Data serialization and cache KryoSerialization – fast and efficient conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); Problem with deseralization of Avro collections (Avro specific list implementation). Solved by implementing a custom Kryo registrator: conf.set("spark.kryo.registrator", "eu.dnetlib.iis.wf.affmatching.AffMatchingJob$AvroCompatibleKryoRegistrator"); The registrator can be found at: https://github.com/CeON/spark-utils/
  • 7. Code serialization Big misunderstanding in books and on the Internet (?) „Learning Spark” in the paragraph describing the change of spark.serializer to Kryo: Whether using Kryo or Java’s serializer, you may encounter a NotSerializableException if your code refers to a class that does not extend Java’s Serializable interface … The setting spark.serializer does not refer to code serialization but to data serialization. It is spark.closure.serializer that corresponds to code serialization, and it uses java serialization by default (it is not recommended that one change it due to the small amount of data serialized/ sent in this case). For this reason classes have to implement Serializable (or Externalizable). Otherwise we get NotSerializableException.
  • 8. Code serialization or mapPartitions? It does not make sense to write functions that operate on partitions (like mapPartitions) and to create service beans in these partitions only to avoid the serialization and sending of the code between nodes. Serializing and copying the code does not have a big influence on the efficiency of an application. Using mapPartitions complicates the code and makes it difficult to write unit tests. void execute() { rdd.mapPartitons(iter -> { SomeService service = new SomeService(); service.generate… ... return someCollection; }) }
  • 9. Data storage format To read an avro file you can use a standard Hadoop API: JavaPairRDD<AvroKey<T>, NullWritable> inputRecords = (JavaPairRDD<AvroKey<T>, NullWritable>) sc.newAPIHadoopFile(avroDatastorePath, AvroKeyInputFormat.class, avroRecordClass, NullWritable.class, job.getConfiguration());
  • 10. Data storage format Problem: when using the standard hadoop API in Spark, you can come across unpredictable errors, because the hadoop record reader reuses the same Writable object for all records read. This is not a problem in the case of MapReduce jobs where each record is processed separately. In Spark, however, it can sometimes lead to undesired effects. For example, in the case of caching an rdd only the last object read will be cached (multiple times, equal to the number of all records read). This probably has something in common with creating multiple references to the same object. To eliminate this phenomenon, one should clone each avro record after it has been read. See: spark-utils/SparkAvroLoader JavaRDD<DocumentToProject> docProjects = avroLoader.loadJavaRDD(sc, inputPath, DocumentToProject.class);
  • 11. Usage of accumulators At first IIS execution report based on accumulators, then just rdd counts. Use wisely (if you have to), and only in actions. Main disadvantages of accumulators:  They allow one to store data in custom structures. Naive usage can lead to memory problems (accumulators on every node, sending to drivers)  When used in transformations (map, filter etc.) - repeating tasks (in case of a node failure or memory space deficiency) can lead to incorrect accumulator values (they are calculated and increased as many times as a given transformation has been repeated). More: http://imranrashid.com/posts/Spark-Accumulators/ http://stackoverflow.com/questions/29494452/when-are-accumulators-truly-reliable
  • 12. Piping to external programs To use an external script in a Spark job, one must upload it to every node: SparkContext.addFile(path) To refer to it one should use (it is advised to do so in the comment to addFile): SparkFiles.get(fileName) It is just that… it only works in local mode! In cluster mode the path is different - the script files are in the working directory of each node. Experience: many non-repeatable errors (everything was fine when a node was on the same server as the driver). For solution see: DocumentClassificationJob
  • 13. Unit tests  Write unit tests as for any other java code. It is not difficult if the code is written properly (just as it is in the case of non-distributed computing).  Mocking JavaRDD is not a problem.  Testing functions (lambda expressions) is a bit tedious. See: AffMatchingServiceTest
  • 14. Testing spark jobs Helpful classes that facilitate the testing of spark jobs: spark-utils/test
  • 15. Spark as an action in Oozie workflow Only one jar can be passed to ‘spark submit’ in an oozie action. Use maven shade plugin or similar tool to merge many jars into one.
  • 16. Integration tests of oozie workflows While working on IIS, one can fire the oozie workflow integration tests from IDE (Eclipse, NetBeans). Proper code creates oozie packages, sends them to a server, polls for the job status and compares the results with those expected. See: AbstractOozieWorkflowTestCase.java
  • 17. Conclusions  It is easier to write and to test a Spark job than an equivalent chain of M-R jobs.  Efficiency: after it has been rewritten from MR to Spark the CitationMatching module execution time fell from 28h to 10h (the comparison is far from perfect because each version was run on a different cluster).  Debugging is difficult.  Easy integration with Oozie. Oozie workflows are less complex, a lot of logic has been moved to Spark.