SlideShare a Scribd company logo
Unit Testing of Spark ApplicationsUnit Testing of Spark Applications
Himanshu Gupta
Sr. Software Consultant
Knoldus Software LLP
Himanshu Gupta
Sr. Software Consultant
Knoldus Software LLP
AgendaAgenda
● What is Spark ?
● What is Unit Testing ?
● Why we need Unit Testing ?
● Unit Testing of Spark Applications
● Demo
● What is Spark ?
● What is Unit Testing ?
● Why we need Unit Testing ?
● Unit Testing of Spark Applications
● Demo
What is Spark ?What is Spark ?
● Distributed compute engine for
large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java
and R (Spark 1.4)
● Combines SQL, streaming and
complex analytics.
● Runs on Hadoop, Mesos, or
in the cloud.
● Distributed compute engine for
large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java
and R (Spark 1.4)
● Combines SQL, streaming and
complex analytics.
● Runs on Hadoop, Mesos, or
in the cloud.
src: http://spark.apache.org/src: http://spark.apache.org/
What is Unit Testing ?What is Unit Testing ?
● Unit Testing is a Software Testing method by which individual units
of source code are tested to determine whether they are fit for use or
not.
● They ensure that code meets its design specifications and behaves as
intended.
● Its goal is to isolate each part of the program and show that the
individual parts are correct.
● Unit Testing is a Software Testing method by which individual units
of source code are tested to determine whether they are fit for use or
not.
● They ensure that code meets its design specifications and behaves as
intended.
● Its goal is to isolate each part of the program and show that the
individual parts are correct.
src: https://en.wikipedia.org/wiki/Unit_testingsrc: https://en.wikipedia.org/wiki/Unit_testing
Why we need Unit Testing ?Why we need Unit Testing ?
● Find problems early
- Finds bugs or missing parts of the specification early in the development cycle.
● Facilitates change
- Helps in refactoring and upgradation without worrying about breaking functionality.
● Simplifies integration
- Makes Integration Tests easier to write.
● Documentation
- Provides a living documentation of the system.
● Design
- Can act as formal design of project.
● Find problems early
- Finds bugs or missing parts of the specification early in the development cycle.
● Facilitates change
- Helps in refactoring and upgradation without worrying about breaking functionality.
● Simplifies integration
- Makes Integration Tests easier to write.
● Documentation
- Provides a living documentation of the system.
● Design
- Can act as formal design of project.
src: https://en.wikipedia.org/wiki/Unit_testingsrc: https://en.wikipedia.org/wiki/Unit_testing
Unit Testing of Spark ApplicationsUnit Testing of Spark Applications
Unit to TestUnit to Test
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class WordCount {
def get(url: String, sc: SparkContext): RDD[(String, Int)] = {
val lines = sc.textFile(url)
lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
}
}
Method 1Method 1
import org.scalatest.{ BeforeAndAfterAll, FunSuite }
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
class WordCountTest extends FunSuite with BeforeAndAfterAll {
private var sparkConf: SparkConf = _
private var sc: SparkContext = _
override def beforeAll() {
sparkConf = new SparkConf().setAppName("unit-testing").setMaster("local")
sc = new SparkContext(sparkConf)
}
private val wordCount = new WordCount
test("get word count rdd") {
val result = wordCount.get("file.txt", sc)
assert(result.take(10).length === 10)
}
override def afterAll() {
sc.stop()
}
}
Cons of Method 1Cons of Method 1
● Explicit management of SparkContext creation and
destruction.
● Developer has to write more lines of code for testing.
● Code duplication as Before and After step has to be repeated
in all Test Suites.
● Explicit management of SparkContext creation and
destruction.
● Developer has to write more lines of code for testing.
● Code duplication as Before and After step has to be repeated
in all Test Suites.
Method 2 (Better Way)Method 2 (Better Way)
"com.holdenkarau" %% "spark-testing-base" % "1.6.1_0.3.2"
Spark Testing Base
A spark package containing base classes to use when writing
tests with Spark.
Spark Testing Base
A spark package containing base classes to use when writing
tests with Spark.
How ?How ?
Method 2 (Better Way) contd...Method 2 (Better Way) contd...
import org.scalatest.FunSuite
import com.holdenkarau.spark.testing.SharedSparkContext
class WordCountTest extends FunSuite with SharedSparkContext {
private val wordCount = new WordCount
test("get word count rdd") {
val result = wordCount.get("file.txt", sc)
assert(result.take(10).length === 10)
}
}
Example 1Example 1
Method 2 (Better Way) contd...Method 2 (Better Way) contd...
import org.scalatest.FunSuite
import com.holdenkarau.spark.testing.SharedSparkContext
import com.holdenkarau.spark.testing.RDDComparisons
class WordCountTest extends FunSuite with SharedSparkContext {
private val wordCount = new WordCount
test("get word count rdd with comparison") {
val expected =
sc.textFile("file.txt")
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
val result = wordCount.get("file.txt", sc)
assert(RDDComparisons.compare(expected, result).isEmpty)
}
}
Example 2Example 2
Pros of Method 2Pros of Method 2
● Succinct code.
● Rich Test API.
● Supports Scala, Java and Python.
● Provides API for testing Streaming applications too.
● Has in-built RDD comparators.
● Supports both Local & Cluster mode testing.
● Succinct code.
● Rich Test API.
● Supports Scala, Java and Python.
● Provides API for testing Streaming applications too.
● Has in-built RDD comparators.
● Supports both Local & Cluster mode testing.
When to use What ?When to use What ?
Method 1
● For Small Scale Spark
applications.
● No requirement of extended
capabilities of spark-testing-base.
● For Sample applications.
Method 1
● For Small Scale Spark
applications.
● No requirement of extended
capabilities of spark-testing-base.
● For Sample applications.
Method 2
● For Large Scale Spark
applications.
● Requirement of Cluster mode or
Performance testing.
● For Production applications.
Method 2
● For Large Scale Spark
applications.
● Requirement of Cluster mode or
Performance testing.
● For Production applications.
DemoDemo
Questions & Option[A]Questions & Option[A]
ReferencesReferences
● https://github.com/holdenk/spark-testing-base
● Effective testing for spark programs Strata NY 2015
● Testing Spark: Best Practices
● https://github.com/holdenk/spark-testing-base
● Effective testing for spark programs Strata NY 2015
● Testing Spark: Best Practices
Thank youThank you

More Related Content

What's hot

How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
ScyllaDB
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus Goehausen
Spark Summit
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
Databricks
 
Karate - Web-Service API Testing Made Simple
Karate - Web-Service API Testing Made SimpleKarate - Web-Service API Testing Made Simple
Karate - Web-Service API Testing Made Simple
VodqaBLR
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
Jordan Halterman
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Test Data Management: The Underestimated Pain
Test Data Management: The Underestimated PainTest Data Management: The Underestimated Pain
Test Data Management: The Underestimated Pain
Chelsea Frischknecht
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 

What's hot (20)

How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus Goehausen
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
 
Karate - Web-Service API Testing Made Simple
Karate - Web-Service API Testing Made SimpleKarate - Web-Service API Testing Made Simple
Karate - Web-Service API Testing Made Simple
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Test Data Management: The Underestimated Pain
Test Data Management: The Underestimated PainTest Data Management: The Underestimated Pain
Test Data Management: The Underestimated Pain
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 

Similar to Unit testing of spark applications

Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
OWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA TestersOWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA Testers
Javan Rasokat
 
Kirill Rozin - Practical Wars for Automatization
Kirill Rozin - Practical Wars for AutomatizationKirill Rozin - Practical Wars for Automatization
Kirill Rozin - Practical Wars for Automatization
Sergey Arkhipov
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
The_Little_Jenkinsfile_That_Could
The_Little_Jenkinsfile_That_CouldThe_Little_Jenkinsfile_That_Could
The_Little_Jenkinsfile_That_Could
Shelley Lambert
 
OpenAPI and gRPC Side by-Side
OpenAPI and gRPC Side by-SideOpenAPI and gRPC Side by-Side
OpenAPI and gRPC Side by-Side
Tim Burks
 
LF_APIStrat17_OpenAPI and gRPC Side-by-Side
LF_APIStrat17_OpenAPI and gRPC Side-by-SideLF_APIStrat17_OpenAPI and gRPC Side-by-Side
LF_APIStrat17_OpenAPI and gRPC Side-by-Side
LF_APIStrat
 
Useful practices of creation automatic tests by using cucumber jvm
Useful practices of creation automatic tests by using cucumber jvmUseful practices of creation automatic tests by using cucumber jvm
Useful practices of creation automatic tests by using cucumber jvm
Anton Shapin
 
Automated Developer Testing: Achievements and Challenges
Automated Developer Testing: Achievements and ChallengesAutomated Developer Testing: Achievements and Challenges
Automated Developer Testing: Achievements and Challenges
Tao Xie
 
Kelly potvin nosurprises_odtug_oow12
Kelly potvin nosurprises_odtug_oow12Kelly potvin nosurprises_odtug_oow12
Kelly potvin nosurprises_odtug_oow12Enkitec
 
Strategy-driven Test Generation with Open Source Frameworks
Strategy-driven Test Generation with Open Source FrameworksStrategy-driven Test Generation with Open Source Frameworks
Strategy-driven Test Generation with Open Source Frameworks
Dimitry Polivaev
 
How to lock a Python in a cage? Managing Python environment inside an R project
How to lock a Python in a cage?  Managing Python environment inside an R projectHow to lock a Python in a cage?  Managing Python environment inside an R project
How to lock a Python in a cage? Managing Python environment inside an R project
WLOG Solutions
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsight
Khalid Salama
 
Performance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-MechanizePerformance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-Mechanize
coreygoldberg
 
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMPInria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
Stéphanie Roger
 

Similar to Unit testing of spark applications (20)

Resume_Shanthi
Resume_ShanthiResume_Shanthi
Resume_Shanthi
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
OWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA TestersOWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA Testers
 
Kirill Rozin - Practical Wars for Automatization
Kirill Rozin - Practical Wars for AutomatizationKirill Rozin - Practical Wars for Automatization
Kirill Rozin - Practical Wars for Automatization
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
The_Little_Jenkinsfile_That_Could
The_Little_Jenkinsfile_That_CouldThe_Little_Jenkinsfile_That_Could
The_Little_Jenkinsfile_That_Could
 
OpenAPI and gRPC Side by-Side
OpenAPI and gRPC Side by-SideOpenAPI and gRPC Side by-Side
OpenAPI and gRPC Side by-Side
 
LF_APIStrat17_OpenAPI and gRPC Side-by-Side
LF_APIStrat17_OpenAPI and gRPC Side-by-SideLF_APIStrat17_OpenAPI and gRPC Side-by-Side
LF_APIStrat17_OpenAPI and gRPC Side-by-Side
 
Useful practices of creation automatic tests by using cucumber jvm
Useful practices of creation automatic tests by using cucumber jvmUseful practices of creation automatic tests by using cucumber jvm
Useful practices of creation automatic tests by using cucumber jvm
 
Automated Developer Testing: Achievements and Challenges
Automated Developer Testing: Achievements and ChallengesAutomated Developer Testing: Achievements and Challenges
Automated Developer Testing: Achievements and Challenges
 
Kelly potvin nosurprises_odtug_oow12
Kelly potvin nosurprises_odtug_oow12Kelly potvin nosurprises_odtug_oow12
Kelly potvin nosurprises_odtug_oow12
 
Strategy-driven Test Generation with Open Source Frameworks
Strategy-driven Test Generation with Open Source FrameworksStrategy-driven Test Generation with Open Source Frameworks
Strategy-driven Test Generation with Open Source Frameworks
 
How to lock a Python in a cage? Managing Python environment inside an R project
How to lock a Python in a cage?  Managing Python environment inside an R projectHow to lock a Python in a cage?  Managing Python environment inside an R project
How to lock a Python in a cage? Managing Python environment inside an R project
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsight
 
Performance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-MechanizePerformance and Scalability Testing with Python and Multi-Mechanize
Performance and Scalability Testing with Python and Multi-Mechanize
 
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMPInria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
 

More from Knoldus Inc.

Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
Secure practices with dot net services.pptx
Secure practices with dot net services.pptxSecure practices with dot net services.pptx
Secure practices with dot net services.pptx
Knoldus Inc.
 
Distributed Cache with dot microservices
Distributed Cache with dot microservicesDistributed Cache with dot microservices
Distributed Cache with dot microservices
Knoldus Inc.
 
Introduction to gRPC Presentation (Java)
Introduction to gRPC Presentation (Java)Introduction to gRPC Presentation (Java)
Introduction to gRPC Presentation (Java)
Knoldus Inc.
 
Using InfluxDB for real-time monitoring in Jmeter
Using InfluxDB for real-time monitoring in JmeterUsing InfluxDB for real-time monitoring in Jmeter
Using InfluxDB for real-time monitoring in Jmeter
Knoldus Inc.
 
Intoduction to KubeVela Presentation (DevOps)
Intoduction to KubeVela Presentation (DevOps)Intoduction to KubeVela Presentation (DevOps)
Intoduction to KubeVela Presentation (DevOps)
Knoldus Inc.
 
Stakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) PresentationStakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) Presentation
Knoldus Inc.
 
Introduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) PresentationIntroduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) Presentation
Knoldus Inc.
 
Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)
Knoldus Inc.
 
Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)
Knoldus Inc.
 
Clean Code in Test Automation Differentiating Between the Good and the Bad
Clean Code in Test Automation  Differentiating Between the Good and the BadClean Code in Test Automation  Differentiating Between the Good and the Bad
Clean Code in Test Automation Differentiating Between the Good and the Bad
Knoldus Inc.
 
Integrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test AutomationIntegrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test Automation
Knoldus Inc.
 
State Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptxState Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptx
Knoldus Inc.
 
Authentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptxAuthentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptx
Knoldus Inc.
 
OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)
Knoldus Inc.
 
Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
Knoldus Inc.
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Knoldus Inc.
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
Knoldus Inc.
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
Knoldus Inc.
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
Knoldus Inc.
 

More from Knoldus Inc. (20)

Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)
 
Secure practices with dot net services.pptx
Secure practices with dot net services.pptxSecure practices with dot net services.pptx
Secure practices with dot net services.pptx
 
Distributed Cache with dot microservices
Distributed Cache with dot microservicesDistributed Cache with dot microservices
Distributed Cache with dot microservices
 
Introduction to gRPC Presentation (Java)
Introduction to gRPC Presentation (Java)Introduction to gRPC Presentation (Java)
Introduction to gRPC Presentation (Java)
 
Using InfluxDB for real-time monitoring in Jmeter
Using InfluxDB for real-time monitoring in JmeterUsing InfluxDB for real-time monitoring in Jmeter
Using InfluxDB for real-time monitoring in Jmeter
 
Intoduction to KubeVela Presentation (DevOps)
Intoduction to KubeVela Presentation (DevOps)Intoduction to KubeVela Presentation (DevOps)
Intoduction to KubeVela Presentation (DevOps)
 
Stakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) PresentationStakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) Presentation
 
Introduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) PresentationIntroduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) Presentation
 
Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)
 
Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)
 
Clean Code in Test Automation Differentiating Between the Good and the Bad
Clean Code in Test Automation  Differentiating Between the Good and the BadClean Code in Test Automation  Differentiating Between the Good and the Bad
Clean Code in Test Automation Differentiating Between the Good and the Bad
 
Integrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test AutomationIntegrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test Automation
 
State Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptxState Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptx
 
Authentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptxAuthentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptx
 
OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)
 
Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 

Recently uploaded

Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 

Recently uploaded (20)

Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 

Unit testing of spark applications

  • 1. Unit Testing of Spark ApplicationsUnit Testing of Spark Applications Himanshu Gupta Sr. Software Consultant Knoldus Software LLP Himanshu Gupta Sr. Software Consultant Knoldus Software LLP
  • 2. AgendaAgenda ● What is Spark ? ● What is Unit Testing ? ● Why we need Unit Testing ? ● Unit Testing of Spark Applications ● Demo ● What is Spark ? ● What is Unit Testing ? ● Why we need Unit Testing ? ● Unit Testing of Spark Applications ● Demo
  • 3. What is Spark ?What is Spark ? ● Distributed compute engine for large-scale data processing. ● 100x faster than Hadoop MapReduce. ● Provides APIs in Python, Scala, Java and R (Spark 1.4) ● Combines SQL, streaming and complex analytics. ● Runs on Hadoop, Mesos, or in the cloud. ● Distributed compute engine for large-scale data processing. ● 100x faster than Hadoop MapReduce. ● Provides APIs in Python, Scala, Java and R (Spark 1.4) ● Combines SQL, streaming and complex analytics. ● Runs on Hadoop, Mesos, or in the cloud. src: http://spark.apache.org/src: http://spark.apache.org/
  • 4. What is Unit Testing ?What is Unit Testing ? ● Unit Testing is a Software Testing method by which individual units of source code are tested to determine whether they are fit for use or not. ● They ensure that code meets its design specifications and behaves as intended. ● Its goal is to isolate each part of the program and show that the individual parts are correct. ● Unit Testing is a Software Testing method by which individual units of source code are tested to determine whether they are fit for use or not. ● They ensure that code meets its design specifications and behaves as intended. ● Its goal is to isolate each part of the program and show that the individual parts are correct. src: https://en.wikipedia.org/wiki/Unit_testingsrc: https://en.wikipedia.org/wiki/Unit_testing
  • 5. Why we need Unit Testing ?Why we need Unit Testing ? ● Find problems early - Finds bugs or missing parts of the specification early in the development cycle. ● Facilitates change - Helps in refactoring and upgradation without worrying about breaking functionality. ● Simplifies integration - Makes Integration Tests easier to write. ● Documentation - Provides a living documentation of the system. ● Design - Can act as formal design of project. ● Find problems early - Finds bugs or missing parts of the specification early in the development cycle. ● Facilitates change - Helps in refactoring and upgradation without worrying about breaking functionality. ● Simplifies integration - Makes Integration Tests easier to write. ● Documentation - Provides a living documentation of the system. ● Design - Can act as formal design of project. src: https://en.wikipedia.org/wiki/Unit_testingsrc: https://en.wikipedia.org/wiki/Unit_testing
  • 6. Unit Testing of Spark ApplicationsUnit Testing of Spark Applications
  • 7. Unit to TestUnit to Test import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD class WordCount { def get(url: String, sc: SparkContext): RDD[(String, Int)] = { val lines = sc.textFile(url) lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _) } }
  • 8. Method 1Method 1 import org.scalatest.{ BeforeAndAfterAll, FunSuite } import org.apache.spark.SparkContext import org.apache.spark.SparkConf class WordCountTest extends FunSuite with BeforeAndAfterAll { private var sparkConf: SparkConf = _ private var sc: SparkContext = _ override def beforeAll() { sparkConf = new SparkConf().setAppName("unit-testing").setMaster("local") sc = new SparkContext(sparkConf) } private val wordCount = new WordCount test("get word count rdd") { val result = wordCount.get("file.txt", sc) assert(result.take(10).length === 10) } override def afterAll() { sc.stop() } }
  • 9. Cons of Method 1Cons of Method 1 ● Explicit management of SparkContext creation and destruction. ● Developer has to write more lines of code for testing. ● Code duplication as Before and After step has to be repeated in all Test Suites. ● Explicit management of SparkContext creation and destruction. ● Developer has to write more lines of code for testing. ● Code duplication as Before and After step has to be repeated in all Test Suites.
  • 10. Method 2 (Better Way)Method 2 (Better Way) "com.holdenkarau" %% "spark-testing-base" % "1.6.1_0.3.2" Spark Testing Base A spark package containing base classes to use when writing tests with Spark. Spark Testing Base A spark package containing base classes to use when writing tests with Spark. How ?How ?
  • 11. Method 2 (Better Way) contd...Method 2 (Better Way) contd... import org.scalatest.FunSuite import com.holdenkarau.spark.testing.SharedSparkContext class WordCountTest extends FunSuite with SharedSparkContext { private val wordCount = new WordCount test("get word count rdd") { val result = wordCount.get("file.txt", sc) assert(result.take(10).length === 10) } } Example 1Example 1
  • 12. Method 2 (Better Way) contd...Method 2 (Better Way) contd... import org.scalatest.FunSuite import com.holdenkarau.spark.testing.SharedSparkContext import com.holdenkarau.spark.testing.RDDComparisons class WordCountTest extends FunSuite with SharedSparkContext { private val wordCount = new WordCount test("get word count rdd with comparison") { val expected = sc.textFile("file.txt") .flatMap(_.split(" ")) .map((_, 1)) .reduceByKey(_ + _) val result = wordCount.get("file.txt", sc) assert(RDDComparisons.compare(expected, result).isEmpty) } } Example 2Example 2
  • 13. Pros of Method 2Pros of Method 2 ● Succinct code. ● Rich Test API. ● Supports Scala, Java and Python. ● Provides API for testing Streaming applications too. ● Has in-built RDD comparators. ● Supports both Local & Cluster mode testing. ● Succinct code. ● Rich Test API. ● Supports Scala, Java and Python. ● Provides API for testing Streaming applications too. ● Has in-built RDD comparators. ● Supports both Local & Cluster mode testing.
  • 14. When to use What ?When to use What ? Method 1 ● For Small Scale Spark applications. ● No requirement of extended capabilities of spark-testing-base. ● For Sample applications. Method 1 ● For Small Scale Spark applications. ● No requirement of extended capabilities of spark-testing-base. ● For Sample applications. Method 2 ● For Large Scale Spark applications. ● Requirement of Cluster mode or Performance testing. ● For Production applications. Method 2 ● For Large Scale Spark applications. ● Requirement of Cluster mode or Performance testing. ● For Production applications.
  • 17. ReferencesReferences ● https://github.com/holdenk/spark-testing-base ● Effective testing for spark programs Strata NY 2015 ● Testing Spark: Best Practices ● https://github.com/holdenk/spark-testing-base ● Effective testing for spark programs Strata NY 2015 ● Testing Spark: Best Practices