SlideShare a Scribd company logo
Testing batch and streaming
Spark applications
@lukaszgawron
Software Engineer @PerformGroup
Overview
• Why to run aplication outside of a cluster?
• Spark in nutshell
• Unit and integration tests
• Tools
• Spark Streaming integration tests
• Best practices and pitfalls
Why to run application outside of a cluster?
Why we want to test?
Why we want to test?
• safety / regression
Why we want to test?
• safety / regression
• fast feedback
Why we want to test?
• safety / regression
• fast feedback
• communication
Why we want to test?
• safety / regression
• fast feedback
• communication
• best possible design
Spark in nutshell
Batch and streaming
Example – word count
WordCount maps (extracts) words from an input source and reduces
(summarizes) the results, returning a count of each word.
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
wordsRDD
.flatMap((line: String) => line.split(" "))
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
})
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
wordsRDD
.flatMap((line: String) => line.split(" "))
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
}).saveAsTextFile("/tmp/output")
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
wordsRDD
.flatMap((line: String) => line.split(" "))
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
}).saveAsTextFile("/tmp/output")
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("Quality Excites")
val sc = new SparkContext(conf)
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
wordsRDD
.flatMap(WordsCount.extractWords)
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
}).saveAsTextFile("/tmp/output")
object WordsCount {
def extractWords(line: String): Array[String] = {
line.split(" ")
}
}
Example unit test
class S00_UnitTest extends FunSpec with Matchers {
it("should split a sentence into words") {
val line = "Ala ma kota"
val words: Array[String] = WordCount.extractWords(line = line)
val expected = Array("Ala", "ma", "kota")
words should be (expected)
}
}
Example unit test
class S00_UnitTest extends FunSpec with Matchers {
it("should split a sentence into words") {
val line = "Ala ma kota"
val words: Array[String] = WordCount.extractWords(line = line)
val expected = Array("Ala", "ma", "kota")
words should be (expected)
}
}
Example unit test
class BasicScalaTest extends FunSpec with Matchers{
}
Example unit test
class S00_UnitTest extends BasicScalaTest {
it("should split a sentence into words") {
val line = "Ala ma kota"
val words: Array[String] = WordCount.extractWords(line = line)
val expected = Array("Ala", "ma", "kota")
words should be (expected)
}
}
Things to note
• Extract anonymous functions so they will be testable
• what can be unit tested?
• Executor and driver code not related to Spark
• Udf functions
Production code vs test code
Production code vs test code
Production code
• distributed mode
Test code
• local mode
Production code vs test code
Production code
• distributed mode
• RDD from storage
Test code
• local mode
• RDD from resources/memory
Production code vs test code
Production code
• distributed mode
• RDD from storage
• Evaluate transformations on RDD
or DStream API.
Test code
• local mode
• RDD from resources/memory
• Evaluate transformations on RDD
or DStream API.
Production code vs test code
Production code
• distributed mode
• RDD from storage
• Evaluate transformations on RDD
or DStream API.
• Store outcomes
Test code
• local mode
• RDD from resources/memory
• Evaluate transformations on RDD
or DStream API.
• Assert outcomes
What to test in integration tests?
What to test in integration tests?
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
wordsRDD
.flatMap((line: String) => line.split(" "))
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
}).saveAsTextFile("/tmp/output")
What to test in integration tests?
val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")
val wordsRDD: RDD[String] = sc.parallelize(words)
wordsRDD
.flatMap((line: String) => line.split(" "))
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
}).saveAsTextFile("/tmp/output")
Integration test
def extractAndCountWords(wordsRDD: RDD[String]): RDD[(String, Int)]
= {
wordsRDD
.flatMap(WordCount.extractWords)
.map((word: String) => (word, 1))
.reduceByKey((occurence1: Int, occurence2: Int) => {
occurence1 + occurence2
})
}
Integration test - RDD
class S01_IntegrationTest extends SparkSessionBase {
it("should count words occurence in all lines") {
Given("RDD of sentences")
val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i
Lolek", "Ala ma psa"))
When("extract and count words")
val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd)
val actual: Map[String, Int] = wordsCountRdd.collectAsMap()
Then("words should be counted")
val expected = Map(
"Ala" -> 2,
"ma" -> 2,
"kota" -> 1,
................
)
actual should be(expected)
class S01_IntegrationTest extends SparkSessionBase {
it("should count words occurence in all lines") {
Given("RDD of sentences")
val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i
Lolek", "Ala ma psa"))
When("extract and count words")
val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd)
val actual: Map[String, Int] = wordsCountRdd.collectAsMap()
Then("words should be counted")
val expected = Map(
"Ala" -> 2,
"ma" -> 2,
"kota" -> 1,
................
)
actual should be(expected)
class SparkSessionBase extends FunSpec with BeforeAndAfterAll with Matchers with
GivenWhenThen {
var ss: SparkSession = _
override def beforeAll() {
val conf = new SparkConf()
.setMaster("local[4]")
ss = SparkSession.builder()
.appName("TestApp" + System.currentTimeMillis())
.config(conf)
.getOrCreate()
}
override def afterAll() {
ss.stop()
ss = null
}
class S01_IntegrationTest extends SparkSessionBase {
it("should count words occurence in all lines") {
Given("RDD of sentences")
val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i
Lolek", "Ala ma psa"))
When("extract and count words")
val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd)
val actual: Map[String, Int] = wordsCountRdd.collectAsMap()
Then("words should be counted")
val expected = Map(
"Ala" -> 2,
"ma" -> 2,
"kota" -> 1,
................
)
actual should equal(expected)
Integration test – DataFrame
def extractFilterAndCountWords(wordsDf: DataFrame): DataFrame = {
val words: Column = explode(split(col("line"), " ")).as("word")
wordsDf
.select(words)
.where(
col("word").equalTo("Ala").or(col("word").equalTo("Bolek")))
.groupBy("word")
.count()
}
it("should count words occurence in all lines") {
Given("few lines of sentences")
val schema = StructType(List(
StructField("line", StringType, true)
))
val linesDf: DataFrame = ss.read.schema(schema).json(getResourcePath("/text.json"))
When("extract and count words")
val wordsCountDf: DataFrame = WordCount.extractFilterAndCountWords(linesDf)
val wordCount: Array[Row] = wordsCountDf.collect()
Then("filtered words should be counted")
val actualWordCount = wordCount
.map((row: Row) =>Tuple2(row.getAs[String]("word"), row.getAs[Long]("count")))
.toMap
val expectedWordCount = Map("Ala" -> 2,"Bolek" -> 1)
actualWordCount should be(expectedWordCount)
}
it("should count words occurence in all lines") {
Given("few lines of sentences")
val schema = StructType(List(
StructField("line", StringType, true)
))
val linesDf: DataFrame = ss.read.schema(schema).json(getResourcePath("/text.json"))
When("extract and count words")
val wordsCountDf: DataFrame = WordCount.extractFilterAndCountWords(linesDf)
val wordCount: Array[Row] = wordsCountDf.collect()
Then("filtered words should be counted")
val actualWordCount = wordCount
.map((row: Row) =>Tuple2(row.getAs[String]("word"), row.getAs[Long]("count")))
.toMap
val expectedWordCount = Map("Ala" -> 2,"Bolek" -> 1)
actualWordCount should be(expectedWordCount)
}
Integration test – Dataset
def extractFilterAndCountWordsDataset(wordsDs: Dataset[Line]):
Dataset[WordCount] = {
import wordsDs.sparkSession.implicits._
wordsDs
.flatMap((line: Line) => line.text.split(" "))
.filter((word: String) => word == "Ala" || word == "Bolek")
.groupBy(col("word"))
.agg(count("word").as("count"))
.as[WordCount]
}
it("should return total count of Ala and Bolek words in all lines of text") {
Given("few sentences")
implicit val lineEncoder = product[Line]
val lines = List(
Line(text = "Ala ma kota"),
Line(text = "Bolek i Lolek"),
Line(text = "Ala ma psa"))
val linesDs: Dataset[Line] = ss.createDataset(lines)
When("extract and count words")
val wordsCountDs: Dataset[WordCount] = WordsCount
.extractFilterAndCountWordsDataset(linesDs)
val actualWordCount: Array[WordCount] = wordsCountDs.collect()
Then("filtered words should be counted")
val expectedWordCount = Array(WordCount("Ala", 2),WordCount("Bolek", 1))
actualWordCount should contain theSameElementsAs expectedWordCount
}
it("should return total count of Ala and Bolek words in all lines of text") {
import spark.implicits._
Given("few sentences")
implicit val lineEncoder = product[Line]
val linesDs: Dataset[Lines] = List(
Line(text = "Ala ma kota"),
Line(text = "Bolek i Lolek"),
Line(text = "Ala ma psa")).toDS()
When("extract and count words")
val wordsCountDs: Dataset[WordCount] = WordsCount
.extractFilterAndCountWordsDataset(linesDs)
val actualWordCount: Array[WordCount] = wordsCountDs.collect()
Then("filtered words should be counted")
val expectedWordCount = Array(WordCount("Ala", 2),WordCount("Bolek", 1))
actualWordCount should contain theSameElementsAs expectedWordCount
}
Things to note
• What can be tested in integration tests?
• Single transformation on Spark abstractions
• Chain of transformations
• Integration with external services e.g. Kafka, HDFS, YARN
• Embedded instances
• Docker environment
• Prefer Datasets over RDDs or DataFrames
Tools
spark-fast-tests
class S04_IntegrationDatasetFastTest extends SparkSessionBase with DatasetComparer {
it("should return total count of Ala and Bolek words in all lines of text ") {
Given("few lines of sentences")
implicit val lineEncoder = product[Line]
implicit val wordEncoder = product[WordCount]
val lines = List(Line(text = "Ala ma kota"),Line(text = "Bolek i Lolek"),Line(text = "Ala ma
psa"))
val linesDs: Dataset[Line] = ss.createDataset(lines)
When("extract and count words")
val wordsCountDs: Dataset[WordCount] = WordsCount
.extractFilterAndCountWordsDataset(linesDs)
Then("filtered words should be counted")
val expectedDs = ss.createDataset(Array(WordCount("Ala", 2),WordCount("Bolek", 1)))
assertSmallDatasetEquality(wordsCountDs, expectedDs, orderedComparison = false)
spark-fast-tests – nice failure messages
Different values
Spark Testing Base
class S06_01_IntegrationDatasetSparkTestingBaseTest extends FunSpec with DatasetSuiteBase with
GivenWhenThen {
it("counting word occurences on few lines of text should return count Ala and Bolek words in this
text") {
Given("few lines of sentences")
implicit val lineEncoder = product[Line]
implicit val wordEncoder = product[WordCount]
val lines = List(Line(text = "Ala ma kota"), Line(text = "Bolek i Lolek"), Line(text = "Ala ma psa"))
val linesDs: Dataset[Line] = spark.createDataset(lines)
When("extract and count words")
val wordsCountDs: Dataset[WordCount] = WordsCount.extractFilterAndCountWordsDataset(linesDs)
Then("filtered words should be counted")
val expectedDs: Dataset[WordCount] = spark.createDataset(Seq(WordCount("Bolek",
1),WordCount("Ala", 2)))
assertDatasetEquals(expected = expectedDs, result = wordsCountDs)
Spark Testing Base – not so nice failure
messages
• Different length
1 did not equal 2 Length not EqualScalaTestFailureLocation:
com.holdenkarau.spark.testing.TestSuite$class at
• Different order of elements
Tuple2;((0,(WordCount(Ala,2),WordCount(Bolek,1))),
(1,(WordCount(Bolek,1),WordCount(Ala,2)))) was not empty
• Differente values
Tuple2;((0,(WordCount(Bole,1),WordCount(Bolek,1)))) was not empty
spark-fast-test vs spark-testing-base
Other tools
• https://github.com/dwestheide/kontextfrei
• https://github.com/MrPowers/spark-daria
• https://github.com/hammerlab/spark-tests
Spark streaming - inifinite flow of data
DStream
DStream transformations
Streaming – spark testing base
class S06_02_StreamingTest_SparkTestingBase extends FunSuite with
StreamingSuiteBase {
test("count words") {
val input = List(List("a b"))
val expected = List(List(("a", 1), ("b", 1)))
testOperation[String, (String, Int)](input, count _, expected, ordered = false)
}
// This is the sample operation we are testing
def count(lines: DStream[String]): DStream[(String, Int)] = {
lines.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
}
}
How to design easy testable Spark code?
• Extract functions so they will be reusable and testable
• Single transformation should do one thing
• Compose transformations using „transform” function
• Prefer Column based functions over UDFs
• Column based functions
• Dataset operators
• UDF functions
Name
Quality Excites
Name
Quality Excites
Name Greeting
Quality Excites Hello!!
Column based function
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
object HelloWorld {
def withGreeting()(df: DataFrame): DataFrame = {
df.withColumn("greeting", lit(”Hello!!"))
}
}
//def lit(literal: Any): Column
it("appends a greeting column to a Dataframe") {
Given("Source dataframe")
val sourceDF = Seq(
("Quality Excites")
).toDF("name")
When("adding greeting column")
val actualDF = sourceDF
.transform(HelloWorld.withGreeting())
Then("new data frame contains column greeting")
val expectedSchema = List(StructField("name", StringType, true),StructField("greeting",
StringType, false))
val expectedData = Seq(Row("Quality Excites", ”Hello!!"))
val expectedDF =
ss.createDataFrame(ss.sparkContext.parallelize(expectedData),StructType(expectedSchema))
assertSmallDatasetEquality(actualDF, expectedDF, orderedComparison = false)
}
it("appends a greeting column to a Dataframe") {
Given("Source dataframe")
val sourceDF = Seq(
("Quality Excites")
).toDF("name")
When("adding greeting column")
val actualDF = sourceDF
.transform(HelloWorld.withGreeting())
.transform(HelloWorld.withGreetingUdf())
object HelloWorld {
def withGreeting()(df: DataFrame): DataFrame = {
df.withColumn("greeting", lit("Hello!!"))
}
val litFunction: () => String = () => "Hello!!"
val udfLit = udf(litFunction)
def withGreetingUdf()(df: DataFrame): DataFrame = {
df.withColumn("greetingUdf", udfLit())
}
}
Pitfalls you should look out
• cannot refer to one RDD inside another RDD
• processing batch of data, not single message or domain entity
• case classes defined in test class body - throws SerializationException
• Spark reads json based on http://jsonlines.org/ specification
Costly false positive
Q&A
Thank you
References
• https://databricks.com/session/mastering-spark-unit-testing
• https://medium.com/@mrpowers/designing-easily-testable-spark-code-
df0755ef00a4
• https://medium.com/@mrpowers/testing-spark-applications-
8c590d3215fa
• http://shop.oreilly.com/product/0636920046967.do
• https://spark.apache.org/docs/latest/streaming-programming-guide.html
• https://spark.apache.org/docs/latest/sql-programming-guide.html
• https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs-
blackbox.html

More Related Content

What's hot

Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
Databricks
 
Enhancements that will make your sql database roar sp1 edition sql bits 2017
Enhancements that will make your sql database roar sp1 edition sql bits 2017Enhancements that will make your sql database roar sp1 edition sql bits 2017
Enhancements that will make your sql database roar sp1 edition sql bits 2017
Bob Ward
 
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalApache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Databricks
 
Brk3043 azure sql db intelligent cloud database for app developers - wash dc
Brk3043 azure sql db   intelligent cloud database for app developers - wash dcBrk3043 azure sql db   intelligent cloud database for app developers - wash dc
Brk3043 azure sql db intelligent cloud database for app developers - wash dc
Bob Ward
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
Composable Data Processing with Apache Spark
Composable Data Processing with Apache SparkComposable Data Processing with Apache Spark
Composable Data Processing with Apache Spark
Databricks
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesDeep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Amazon Web Services
 
Solving low latency query over big data with Spark SQL
Solving low latency query over big data with Spark SQLSolving low latency query over big data with Spark SQL
Solving low latency query over big data with Spark SQL
Julien Pierre
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Amazon Web Services
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
Mark Kromer
 
The Pill for Your Migration Hell
The Pill for Your Migration HellThe Pill for Your Migration Hell
The Pill for Your Migration Hell
Databricks
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
Amazon Web Services
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
joeharris76
 
Explore Azure Cosmos DB
Explore Azure Cosmos DBExplore Azure Cosmos DB
Explore Azure Cosmos DB
Microsoft Tech Community
 
How to performance tune spark applications in large clusters
How to performance tune spark applications in large clustersHow to performance tune spark applications in large clusters
How to performance tune spark applications in large clusters
Omkar Joshi
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 

What's hot (20)

Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Enhancements that will make your sql database roar sp1 edition sql bits 2017
Enhancements that will make your sql database roar sp1 edition sql bits 2017Enhancements that will make your sql database roar sp1 edition sql bits 2017
Enhancements that will make your sql database roar sp1 edition sql bits 2017
 
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalApache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
 
Brk3043 azure sql db intelligent cloud database for app developers - wash dc
Brk3043 azure sql db   intelligent cloud database for app developers - wash dcBrk3043 azure sql db   intelligent cloud database for app developers - wash dc
Brk3043 azure sql db intelligent cloud database for app developers - wash dc
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Composable Data Processing with Apache Spark
Composable Data Processing with Apache SparkComposable Data Processing with Apache Spark
Composable Data Processing with Apache Spark
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesDeep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
 
Solving low latency query over big data with Spark SQL
Solving low latency query over big data with Spark SQLSolving low latency query over big data with Spark SQL
Solving low latency query over big data with Spark SQL
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
The Pill for Your Migration Hell
The Pill for Your Migration HellThe Pill for Your Migration Hell
The Pill for Your Migration Hell
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
 
Explore Azure Cosmos DB
Explore Azure Cosmos DBExplore Azure Cosmos DB
Explore Azure Cosmos DB
 
How to performance tune spark applications in large clusters
How to performance tune spark applications in large clustersHow to performance tune spark applications in large clusters
How to performance tune spark applications in large clusters
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 

Similar to Testing batch and streaming Spark applications

Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Apache spark: in and out
Apache spark: in and outApache spark: in and out
Apache spark: in and out
Ben Fradet
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
Denis Dus
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
Craig Warman
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
Thành Nguyễn
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
José Carlos García Serrano
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
Stuart Roebuck
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 

Similar to Testing batch and streaming Spark applications (20)

Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Apache spark: in and out
Apache spark: in and outApache spark: in and out
Apache spark: in and out
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 

Recently uploaded

LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfRevolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Undress Baby
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
Yara Milbes
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 

Recently uploaded (20)

LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfRevolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 

Testing batch and streaming Spark applications

  • 1. Testing batch and streaming Spark applications @lukaszgawron Software Engineer @PerformGroup
  • 2. Overview • Why to run aplication outside of a cluster? • Spark in nutshell • Unit and integration tests • Tools • Spark Streaming integration tests • Best practices and pitfalls
  • 3. Why to run application outside of a cluster?
  • 4. Why we want to test?
  • 5. Why we want to test? • safety / regression
  • 6. Why we want to test? • safety / regression • fast feedback
  • 7. Why we want to test? • safety / regression • fast feedback • communication
  • 8. Why we want to test? • safety / regression • fast feedback • communication • best possible design
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 16.
  • 17. Example – word count WordCount maps (extracts) words from an input source and reduces (summarizes) the results, returning a count of each word.
  • 18. object App { def main(args: Array[String]): Unit = { val conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf)
  • 19. object App { def main(args: Array[String]): Unit = { val conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words)
  • 20. object App { def main(args: Array[String]): Unit = { val conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 })
  • 21. object App { def main(args: Array[String]): Unit = { val conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")
  • 22. object App { def main(args: Array[String]): Unit = { val conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")
  • 23. object App { def main(args: Array[String]): Unit = { val conf = new SparkConf() .setMaster("local[4]") .setAppName("Quality Excites") val sc = new SparkContext(conf) val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap(WordsCount.extractWords) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")
  • 24. object WordsCount { def extractWords(line: String): Array[String] = { line.split(" ") } }
  • 25. Example unit test class S00_UnitTest extends FunSpec with Matchers { it("should split a sentence into words") { val line = "Ala ma kota" val words: Array[String] = WordCount.extractWords(line = line) val expected = Array("Ala", "ma", "kota") words should be (expected) } }
  • 26. Example unit test class S00_UnitTest extends FunSpec with Matchers { it("should split a sentence into words") { val line = "Ala ma kota" val words: Array[String] = WordCount.extractWords(line = line) val expected = Array("Ala", "ma", "kota") words should be (expected) } }
  • 27. Example unit test class BasicScalaTest extends FunSpec with Matchers{ }
  • 28. Example unit test class S00_UnitTest extends BasicScalaTest { it("should split a sentence into words") { val line = "Ala ma kota" val words: Array[String] = WordCount.extractWords(line = line) val expected = Array("Ala", "ma", "kota") words should be (expected) } }
  • 29. Things to note • Extract anonymous functions so they will be testable • what can be unit tested? • Executor and driver code not related to Spark • Udf functions
  • 30. Production code vs test code
  • 31. Production code vs test code Production code • distributed mode Test code • local mode
  • 32. Production code vs test code Production code • distributed mode • RDD from storage Test code • local mode • RDD from resources/memory
  • 33. Production code vs test code Production code • distributed mode • RDD from storage • Evaluate transformations on RDD or DStream API. Test code • local mode • RDD from resources/memory • Evaluate transformations on RDD or DStream API.
  • 34. Production code vs test code Production code • distributed mode • RDD from storage • Evaluate transformations on RDD or DStream API. • Store outcomes Test code • local mode • RDD from resources/memory • Evaluate transformations on RDD or DStream API. • Assert outcomes
  • 35. What to test in integration tests?
  • 36. What to test in integration tests? val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")
  • 37. What to test in integration tests? val words = List("Ala ma kota", "Bolek i Lolek", "Ala ma psa") val wordsRDD: RDD[String] = sc.parallelize(words) wordsRDD .flatMap((line: String) => line.split(" ")) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }).saveAsTextFile("/tmp/output")
  • 38. Integration test def extractAndCountWords(wordsRDD: RDD[String]): RDD[(String, Int)] = { wordsRDD .flatMap(WordCount.extractWords) .map((word: String) => (word, 1)) .reduceByKey((occurence1: Int, occurence2: Int) => { occurence1 + occurence2 }) }
  • 40. class S01_IntegrationTest extends SparkSessionBase { it("should count words occurence in all lines") { Given("RDD of sentences") val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")) When("extract and count words") val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd) val actual: Map[String, Int] = wordsCountRdd.collectAsMap() Then("words should be counted") val expected = Map( "Ala" -> 2, "ma" -> 2, "kota" -> 1, ................ ) actual should be(expected)
  • 41. class S01_IntegrationTest extends SparkSessionBase { it("should count words occurence in all lines") { Given("RDD of sentences") val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")) When("extract and count words") val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd) val actual: Map[String, Int] = wordsCountRdd.collectAsMap() Then("words should be counted") val expected = Map( "Ala" -> 2, "ma" -> 2, "kota" -> 1, ................ ) actual should be(expected)
  • 42. class SparkSessionBase extends FunSpec with BeforeAndAfterAll with Matchers with GivenWhenThen { var ss: SparkSession = _ override def beforeAll() { val conf = new SparkConf() .setMaster("local[4]") ss = SparkSession.builder() .appName("TestApp" + System.currentTimeMillis()) .config(conf) .getOrCreate() } override def afterAll() { ss.stop() ss = null }
  • 43. class S01_IntegrationTest extends SparkSessionBase { it("should count words occurence in all lines") { Given("RDD of sentences") val linesRdd: RDD[String] = ss.sparkContext.parallelize(List("Ala ma kota", "Bolek i Lolek", "Ala ma psa")) When("extract and count words") val wordsCountRdd: RDD[(String, Int)] = WordsCount.extractAndCountWords(linesRdd) val actual: Map[String, Int] = wordsCountRdd.collectAsMap() Then("words should be counted") val expected = Map( "Ala" -> 2, "ma" -> 2, "kota" -> 1, ................ ) actual should equal(expected)
  • 44. Integration test – DataFrame def extractFilterAndCountWords(wordsDf: DataFrame): DataFrame = { val words: Column = explode(split(col("line"), " ")).as("word") wordsDf .select(words) .where( col("word").equalTo("Ala").or(col("word").equalTo("Bolek"))) .groupBy("word") .count() }
  • 45. it("should count words occurence in all lines") { Given("few lines of sentences") val schema = StructType(List( StructField("line", StringType, true) )) val linesDf: DataFrame = ss.read.schema(schema).json(getResourcePath("/text.json")) When("extract and count words") val wordsCountDf: DataFrame = WordCount.extractFilterAndCountWords(linesDf) val wordCount: Array[Row] = wordsCountDf.collect() Then("filtered words should be counted") val actualWordCount = wordCount .map((row: Row) =>Tuple2(row.getAs[String]("word"), row.getAs[Long]("count"))) .toMap val expectedWordCount = Map("Ala" -> 2,"Bolek" -> 1) actualWordCount should be(expectedWordCount) }
  • 46. it("should count words occurence in all lines") { Given("few lines of sentences") val schema = StructType(List( StructField("line", StringType, true) )) val linesDf: DataFrame = ss.read.schema(schema).json(getResourcePath("/text.json")) When("extract and count words") val wordsCountDf: DataFrame = WordCount.extractFilterAndCountWords(linesDf) val wordCount: Array[Row] = wordsCountDf.collect() Then("filtered words should be counted") val actualWordCount = wordCount .map((row: Row) =>Tuple2(row.getAs[String]("word"), row.getAs[Long]("count"))) .toMap val expectedWordCount = Map("Ala" -> 2,"Bolek" -> 1) actualWordCount should be(expectedWordCount) }
  • 47. Integration test – Dataset def extractFilterAndCountWordsDataset(wordsDs: Dataset[Line]): Dataset[WordCount] = { import wordsDs.sparkSession.implicits._ wordsDs .flatMap((line: Line) => line.text.split(" ")) .filter((word: String) => word == "Ala" || word == "Bolek") .groupBy(col("word")) .agg(count("word").as("count")) .as[WordCount] }
  • 48. it("should return total count of Ala and Bolek words in all lines of text") { Given("few sentences") implicit val lineEncoder = product[Line] val lines = List( Line(text = "Ala ma kota"), Line(text = "Bolek i Lolek"), Line(text = "Ala ma psa")) val linesDs: Dataset[Line] = ss.createDataset(lines) When("extract and count words") val wordsCountDs: Dataset[WordCount] = WordsCount .extractFilterAndCountWordsDataset(linesDs) val actualWordCount: Array[WordCount] = wordsCountDs.collect() Then("filtered words should be counted") val expectedWordCount = Array(WordCount("Ala", 2),WordCount("Bolek", 1)) actualWordCount should contain theSameElementsAs expectedWordCount }
  • 49. it("should return total count of Ala and Bolek words in all lines of text") { import spark.implicits._ Given("few sentences") implicit val lineEncoder = product[Line] val linesDs: Dataset[Lines] = List( Line(text = "Ala ma kota"), Line(text = "Bolek i Lolek"), Line(text = "Ala ma psa")).toDS() When("extract and count words") val wordsCountDs: Dataset[WordCount] = WordsCount .extractFilterAndCountWordsDataset(linesDs) val actualWordCount: Array[WordCount] = wordsCountDs.collect() Then("filtered words should be counted") val expectedWordCount = Array(WordCount("Ala", 2),WordCount("Bolek", 1)) actualWordCount should contain theSameElementsAs expectedWordCount }
  • 50. Things to note • What can be tested in integration tests? • Single transformation on Spark abstractions • Chain of transformations • Integration with external services e.g. Kafka, HDFS, YARN • Embedded instances • Docker environment • Prefer Datasets over RDDs or DataFrames
  • 51. Tools
  • 52. spark-fast-tests class S04_IntegrationDatasetFastTest extends SparkSessionBase with DatasetComparer { it("should return total count of Ala and Bolek words in all lines of text ") { Given("few lines of sentences") implicit val lineEncoder = product[Line] implicit val wordEncoder = product[WordCount] val lines = List(Line(text = "Ala ma kota"),Line(text = "Bolek i Lolek"),Line(text = "Ala ma psa")) val linesDs: Dataset[Line] = ss.createDataset(lines) When("extract and count words") val wordsCountDs: Dataset[WordCount] = WordsCount .extractFilterAndCountWordsDataset(linesDs) Then("filtered words should be counted") val expectedDs = ss.createDataset(Array(WordCount("Ala", 2),WordCount("Bolek", 1))) assertSmallDatasetEquality(wordsCountDs, expectedDs, orderedComparison = false)
  • 53. spark-fast-tests – nice failure messages Different values
  • 54. Spark Testing Base class S06_01_IntegrationDatasetSparkTestingBaseTest extends FunSpec with DatasetSuiteBase with GivenWhenThen { it("counting word occurences on few lines of text should return count Ala and Bolek words in this text") { Given("few lines of sentences") implicit val lineEncoder = product[Line] implicit val wordEncoder = product[WordCount] val lines = List(Line(text = "Ala ma kota"), Line(text = "Bolek i Lolek"), Line(text = "Ala ma psa")) val linesDs: Dataset[Line] = spark.createDataset(lines) When("extract and count words") val wordsCountDs: Dataset[WordCount] = WordsCount.extractFilterAndCountWordsDataset(linesDs) Then("filtered words should be counted") val expectedDs: Dataset[WordCount] = spark.createDataset(Seq(WordCount("Bolek", 1),WordCount("Ala", 2))) assertDatasetEquals(expected = expectedDs, result = wordsCountDs)
  • 55. Spark Testing Base – not so nice failure messages • Different length 1 did not equal 2 Length not EqualScalaTestFailureLocation: com.holdenkarau.spark.testing.TestSuite$class at • Different order of elements Tuple2;((0,(WordCount(Ala,2),WordCount(Bolek,1))), (1,(WordCount(Bolek,1),WordCount(Ala,2)))) was not empty • Differente values Tuple2;((0,(WordCount(Bole,1),WordCount(Bolek,1)))) was not empty
  • 57. Other tools • https://github.com/dwestheide/kontextfrei • https://github.com/MrPowers/spark-daria • https://github.com/hammerlab/spark-tests
  • 58. Spark streaming - inifinite flow of data
  • 61. Streaming – spark testing base class S06_02_StreamingTest_SparkTestingBase extends FunSuite with StreamingSuiteBase { test("count words") { val input = List(List("a b")) val expected = List(List(("a", 1), ("b", 1))) testOperation[String, (String, Int)](input, count _, expected, ordered = false) } // This is the sample operation we are testing def count(lines: DStream[String]): DStream[(String, Int)] = { lines.flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) } }
  • 62. How to design easy testable Spark code? • Extract functions so they will be reusable and testable • Single transformation should do one thing • Compose transformations using „transform” function • Prefer Column based functions over UDFs • Column based functions • Dataset operators • UDF functions
  • 65. Column based function import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions._ object HelloWorld { def withGreeting()(df: DataFrame): DataFrame = { df.withColumn("greeting", lit(”Hello!!")) } } //def lit(literal: Any): Column
  • 66. it("appends a greeting column to a Dataframe") { Given("Source dataframe") val sourceDF = Seq( ("Quality Excites") ).toDF("name") When("adding greeting column") val actualDF = sourceDF .transform(HelloWorld.withGreeting()) Then("new data frame contains column greeting") val expectedSchema = List(StructField("name", StringType, true),StructField("greeting", StringType, false)) val expectedData = Seq(Row("Quality Excites", ”Hello!!")) val expectedDF = ss.createDataFrame(ss.sparkContext.parallelize(expectedData),StructType(expectedSchema)) assertSmallDatasetEquality(actualDF, expectedDF, orderedComparison = false) }
  • 67. it("appends a greeting column to a Dataframe") { Given("Source dataframe") val sourceDF = Seq( ("Quality Excites") ).toDF("name") When("adding greeting column") val actualDF = sourceDF .transform(HelloWorld.withGreeting()) .transform(HelloWorld.withGreetingUdf())
  • 68. object HelloWorld { def withGreeting()(df: DataFrame): DataFrame = { df.withColumn("greeting", lit("Hello!!")) } val litFunction: () => String = () => "Hello!!" val udfLit = udf(litFunction) def withGreetingUdf()(df: DataFrame): DataFrame = { df.withColumn("greetingUdf", udfLit()) } }
  • 69. Pitfalls you should look out • cannot refer to one RDD inside another RDD • processing batch of data, not single message or domain entity • case classes defined in test class body - throws SerializationException • Spark reads json based on http://jsonlines.org/ specification
  • 71. Q&A
  • 73. References • https://databricks.com/session/mastering-spark-unit-testing • https://medium.com/@mrpowers/designing-easily-testable-spark-code- df0755ef00a4 • https://medium.com/@mrpowers/testing-spark-applications- 8c590d3215fa • http://shop.oreilly.com/product/0636920046967.do • https://spark.apache.org/docs/latest/streaming-programming-guide.html • https://spark.apache.org/docs/latest/sql-programming-guide.html • https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs- blackbox.html

Editor's Notes

  1. ile osob uzywalo Sparka? ile osob testowalo Sparka? ile osob przyszlo tutaj z ciekawosci? pytania zeby zadawali od razu
  2. http://bigdata-madesimple.com/wp-content/uploads/2015/11/overview-of-spark-components.png
  3. Driver program - runs user main function and executes various parallel operations on a cluster RDDs - colleciton of elements partitioned across nodes of the cluster that can be operated in paralllel Worker - manages resources on cluster node Executor - JVM process which stores and executes tasks Tasks - Executes RDD operations
  4. jak sie nazywa taka suma przsuwajaca?
  5. ScalaTest
  6. ScalaTest
  7. zmienic kolejnosc slajdsow jsona, bo pokazany jest json a nie bylo zadenj shiosritu z nim zwiazanje,
  8. zmienic kolejnosc slajdsow jsona, bo pokazany jest json a nie bylo zadenj shiosritu z nim zwiazanje,