Testing batch and streaming Spark applications

Testing batch and streaming
Spark applications
@lukaszgawron
Software Engineer @PerformGroup

Overview
• Why to run aplication outside of a cluster?
• Spark in nutshell
• Unit and integration tests
• Tools
• Spark Streaming integration tests
• Best practices and pitfalls

Why to run application outside of a cluster?

Why we want to test?
• safety / regression

• fast feedback

• fast feedback
• communication

• fast feedback
• communication
• best possible design

Example – word count
WordCount maps (extracts) words from an input source and reduces
(summarizes) the results, returning a count of each word.

object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)

object App {
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)

object App {
wordsRDD
.flatMap((line: String) => line.split(" "))
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
})

object App {
wordsRDD
}).saveAsTextFile("/tmp/output")

object App {
wordsRDD
.flatMap(WordsCount.extractWords)

object WordsCount {
def extractWords(line: String): Array[String] = {
line.split(" ")
}
}

Example unit test
class S00_UnitTest extends FunSpec with Matchers {
it("should split a sentence into words") {
val line = "Ala ma kota"
val words: Array[String] = WordCount.extractWords(line = line)
val expected = Array("Ala", "ma", "kota")
words should be (expected)
}
}

Example unit test
class BasicScalaTest extends FunSpec with Matchers{
}

Example unit test
class S00_UnitTest extends BasicScalaTest {
it("should split a sentence into words") {
val line = "Ala ma kota"
val words: Array[String] = WordCount.extractWords(line = line)
val expected = Array("Ala", "ma", "kota")
words should be (expected)
}
}

Things to note
• Extract anonymous functions so they will be testable
• what can be unit tested?
• Executor and driver code not related to Spark
• Udf functions

Production code vs test code
Production code
• distributed mode
Test code
• local mode

Production code
• RDD from storage
Test code
• local mode
• RDD from resources/memory

Production code
• Evaluate transformations on RDD
or DStream API.
Test code
• local mode
or DStream API.

Production code
or DStream API.
• Store outcomes
Test code
• local mode
or DStream API.
• Assert outcomes

What to test in integration tests?

What to test in integration tests?
wordsRDD

Integration test
def extractAndCountWords(wordsRDD: RDD[String]): RDD[(String, Int)]
= {
wordsRDD
.flatMap(WordCount.extractWords)
})
}

class S01_IntegrationTest extends SparkSessionBase {
it("should count words occurence in all lines") {
Given("RDD of sentences")
val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i
Lolek", "Ala ma psa"))
When("extract and count words")
val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd)
val actual: Map[String, Int] = wordsCountRdd.collectAsMap()
Then("words should be counted")
val expected = Map(
"Ala" -> 2,
"ma" -> 2,
"kota" -> 1,
................
)
actual should be(expected)

class SparkSessionBase extends FunSpec with BeforeAndAfterAll with Matchers with
GivenWhenThen {
var ss: SparkSession = _
override def beforeAll() {
ss = SparkSession.builder()
.appName("TestApp" + System.currentTimeMillis())
.config(conf)
.getOrCreate()
}
override def afterAll() {
ss.stop()
ss = null
}

class S01_IntegrationTest extends SparkSessionBase {
Given("RDD of sentences")
val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i
Lolek", "Ala ma psa"))
val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd)
val actual: Map[String, Int] = wordsCountRdd.collectAsMap()
Then("words should be counted")
val expected = Map(
"Ala" -> 2,
"ma" -> 2,
"kota" -> 1,
................
)
actual should equal(expected)

Integration test – DataFrame
def extractFilterAndCountWords(wordsDf: DataFrame): DataFrame = {
val words: Column = explode(split(col("line"), " ")).as("word")
wordsDf
.select(words)
.where(
col("word").equalTo("Ala").or(col("word").equalTo("Bolek")))
.groupBy("word")
.count()
}

Given("few lines of sentences")
val schema = StructType(List(
StructField("line", StringType, true)
))
val linesDf: DataFrame = ss.read.schema(schema).json(getResourcePath("/text.json"))
val wordsCountDf: DataFrame = WordCount.extractFilterAndCountWords(linesDf)
val wordCount: Array[Row] = wordsCountDf.collect()
Then("filtered words should be counted")
val actualWordCount = wordCount
.map((row: Row) =>Tuple2(row.getAs[String]("word"), row.getAs[Long]("count")))
.toMap
val expectedWordCount = Map("Ala" -> 2,"Bolek" -> 1)
actualWordCount should be(expectedWordCount)
}

Integration test – Dataset
def extractFilterAndCountWordsDataset(wordsDs: Dataset[Line]):
Dataset[WordCount] = {
import wordsDs.sparkSession.implicits._
wordsDs
.flatMap((line: Line) => line.text.split(" "))
.filter((word: String) => word == "Ala" || word == "Bolek")
.groupBy(col("word"))
.agg(count("word").as("count"))
.as[WordCount]
}

it("should return total count of Ala and Bolek words in all lines of text") {
Given("few sentences")
implicit val lineEncoder = product[Line]
val lines = List(
Line(text = "Ala ma kota"),
Line(text = "Bolek i Lolek"),
Line(text = "Ala ma psa"))
val linesDs: Dataset[Line] = ss.createDataset(lines)
val wordsCountDs: Dataset[WordCount] = WordsCount
.extractFilterAndCountWordsDataset(linesDs)
val actualWordCount: Array[WordCount] = wordsCountDs.collect()
val expectedWordCount = Array(WordCount("Ala", 2),WordCount("Bolek", 1))
actualWordCount should contain theSameElementsAs expectedWordCount
}

it("should return total count of Ala and Bolek words in all lines of text") {
import spark.implicits._
Given("few sentences")
val linesDs: Dataset[Lines] = List(
Line(text = "Ala ma kota"),
Line(text = "Bolek i Lolek"),
Line(text = "Ala ma psa")).toDS()
val actualWordCount: Array[WordCount] = wordsCountDs.collect()
val expectedWordCount = Array(WordCount("Ala", 2),WordCount("Bolek", 1))
actualWordCount should contain theSameElementsAs expectedWordCount
}

Things to note
• What can be tested in integration tests?
• Single transformation on Spark abstractions
• Chain of transformations
• Integration with external services e.g. Kafka, HDFS, YARN
• Embedded instances
• Docker environment
• Prefer Datasets over RDDs or DataFrames

spark-fast-tests
class S04_IntegrationDatasetFastTest extends SparkSessionBase with DatasetComparer {
it("should return total count of Ala and Bolek words in all lines of text ") {
implicit val wordEncoder = product[WordCount]
val lines = List(Line(text = "Ala ma kota"),Line(text = "Bolek i Lolek"),Line(text = "Ala ma
psa"))
val linesDs: Dataset[Line] = ss.createDataset(lines)
val expectedDs = ss.createDataset(Array(WordCount("Ala", 2),WordCount("Bolek", 1)))
assertSmallDatasetEquality(wordsCountDs, expectedDs, orderedComparison = false)

spark-fast-tests – nice failure messages
Different values

Spark Testing Base
class S06_01_IntegrationDatasetSparkTestingBaseTest extends FunSpec with DatasetSuiteBase with
GivenWhenThen {
it("counting word occurences on few lines of text should return count Ala and Bolek words in this
text") {
implicit val wordEncoder = product[WordCount]
val lines = List(Line(text = "Ala ma kota"), Line(text = "Bolek i Lolek"), Line(text = "Ala ma psa"))
val linesDs: Dataset[Line] = spark.createDataset(lines)
val wordsCountDs: Dataset[WordCount] = WordsCount.extractFilterAndCountWordsDataset(linesDs)
val expectedDs: Dataset[WordCount] = spark.createDataset(Seq(WordCount("Bolek",
1),WordCount("Ala", 2)))
assertDatasetEquals(expected = expectedDs, result = wordsCountDs)

Spark Testing Base – not so nice failure
messages
• Different length
1 did not equal 2 Length not EqualScalaTestFailureLocation:
com.holdenkarau.spark.testing.TestSuite$class at
• Different order of elements
Tuple2;((0,(WordCount(Ala,2),WordCount(Bolek,1))),
(1,(WordCount(Bolek,1),WordCount(Ala,2)))) was not empty
• Differente values
Tuple2;((0,(WordCount(Bole,1),WordCount(Bolek,1)))) was not empty

spark-fast-test vs spark-testing-base

Other tools
• https://github.com/dwestheide/kontextfrei
• https://github.com/MrPowers/spark-daria
• https://github.com/hammerlab/spark-tests

Spark streaming - inifinite flow of data

Streaming – spark testing base
class S06_02_StreamingTest_SparkTestingBase extends FunSuite with
StreamingSuiteBase {
test("count words") {
val input = List(List("a b"))
val expected = List(List(("a", 1), ("b", 1)))
testOperation[String, (String, Int)](input, count _, expected, ordered = false)
}
// This is the sample operation we are testing
def count(lines: DStream[String]): DStream[(String, Int)] = {
lines.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
}
}

How to design easy testable Spark code?
• Extract functions so they will be reusable and testable
• Single transformation should do one thing
• Compose transformations using „transform” function
• Prefer Column based functions over UDFs
• Column based functions
• Dataset operators
• UDF functions

Name
Quality Excites
Name Greeting
Quality Excites Hello!!

Column based function
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
object HelloWorld {
def withGreeting()(df: DataFrame): DataFrame = {
df.withColumn("greeting", lit(”Hello!!"))
}
}
//def lit(literal: Any): Column

it("appends a greeting column to a Dataframe") {
Given("Source dataframe")
val sourceDF = Seq(
("Quality Excites")
).toDF("name")
When("adding greeting column")
val actualDF = sourceDF
.transform(HelloWorld.withGreeting())
Then("new data frame contains column greeting")
val expectedSchema = List(StructField("name", StringType, true),StructField("greeting",
StringType, false))
val expectedData = Seq(Row("Quality Excites", ”Hello!!"))
val expectedDF =
ss.createDataFrame(ss.sparkContext.parallelize(expectedData),StructType(expectedSchema))
assertSmallDatasetEquality(actualDF, expectedDF, orderedComparison = false)
}

it("appends a greeting column to a Dataframe") {
Given("Source dataframe")
val sourceDF = Seq(
("Quality Excites")
).toDF("name")
When("adding greeting column")
val actualDF = sourceDF
.transform(HelloWorld.withGreeting())
.transform(HelloWorld.withGreetingUdf())

object HelloWorld {
def withGreeting()(df: DataFrame): DataFrame = {
df.withColumn("greeting", lit("Hello!!"))
}
val litFunction: () => String = () => "Hello!!"
val udfLit = udf(litFunction)
def withGreetingUdf()(df: DataFrame): DataFrame = {
df.withColumn("greetingUdf", udfLit())
}
}

Pitfalls you should look out
• cannot refer to one RDD inside another RDD
• processing batch of data, not single message or domain entity
• case classes defined in test class body - throws SerializationException
• Spark reads json based on http://jsonlines.org/ specification

References
• https://databricks.com/session/mastering-spark-unit-testing
• https://medium.com/@mrpowers/designing-easily-testable-spark-code-
df0755ef00a4
• https://medium.com/@mrpowers/testing-spark-applications-
8c590d3215fa
• http://shop.oreilly.com/product/0636920046967.do
• https://spark.apache.org/docs/latest/streaming-programming-guide.html
• https://spark.apache.org/docs/latest/sql-programming-guide.html
• https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs-
blackbox.html

Testing batch and streaming Spark applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Testing batch and streaming Spark applications

Similar to Testing batch and streaming Spark applications (20)

Recently uploaded

Recently uploaded (20)

Testing batch and streaming Spark applications

Editor's Notes