Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Modular Apache Spark: Transform Your Code in Pieces

99 views

Published on

Divide and you will conquer Apache Spark. It's quite common to develop a papyrus script where people try to initialize spark, read paths, execute all the logic and write the result. Even, we found scripts where all the spark transformations are done in a simple method with tones of lines. That means the code is difficult to test, to maintain and to read. Well, that means bad code. We built a set of tools and libraries that allows developers to develop their pipelines by joining all the Pieces. These pieces are compressed by Readers, Writers, Transformers, Aliases, etc. Moreover, it comes with enriched SparkSuites using the Spark-testing-base from Holden Karau. Recently, we start using junit4git (github.com/rpau/junit4git) in our tests, allowing us to execute only the Spark tests that matter by skipping tests that are not affected by latest code changes. This translates into faster builds and fewer coffees. By allowing developers to define each piece on its own, we enable to test small pieces before having the full set of them together. Also, it allows to re-use code in multiple pipelines and speed up their development by improving the quality of the code. The power of "Transform" method combined with Currying, creates a powerful tool that allows fragmenting all the Spark logic. This talk is oriented to developers that are being introduced in the Spark world and how developing iteration by iteration in small steps could help them in producing great code with less effort.

Published in: Data & Analytics
  • Be the first to comment

Modular Apache Spark: Transform Your Code in Pieces

  1. 1. Albert Franzi (@FranziCros), Alpha Health Modular Apache Spark: Transform Your Code into Pieces
  2. 2. 2 Slides available in: http://bit.ly/SparkAI2019-afranzi
  3. 3. About me 3 2013 2015 20182014 2017 2019
  4. 4. Modular Apache Spark: Transform Your Code into Pieces 4
  5. 5. 5 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark- testing-base provided by Holden Karau We will learn:
  6. 6. 6 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark- testing-base provided by Holden Karau We will learn:
  7. 7. 7 Did you play with duplicated code across your Spark Jobs? @ The Shining 1980 Have you ever experienced the joy of code reviewing a never ending stream of spark chaos?
  8. 8. 8 Please! Don’t play with duplicated code never ever! @ The Shining 1980
  9. 9. Fragment the Spark Job 9 Spark Readers Transformation JoinsAliases Formatters Spark WritersTask Context ...
  10. 10. Readers / Writers 10 Spark Readers Spark Writers ● Enforce schemas ● Use schemas to read only the fields you are going to use ● Provide Readers per Dataset & attach its sources to it ● Share schemas & sources between Readers & Writers ● GDPR compliant by design
  11. 11. val userBehaviourSchema: StructType = ??? val userBehaviourPath = Path("s3://<bucket>/user_behaviour/year=2018/month=10/day=03/hour=12/gen=27/") val userBehaviourReader = ReaderBuilder(PARQUET) .withSchema(userBehaviourSchema) .withPath(userBehaviourPath) .buildReader() val df: DataFrame = userBehaviourReader.read() GDPR Readers 11 Spark Readers
  12. 12. Readers 12 Spark Readers val userBehaviourSchema: StructType = ??? // Path structure - s3://<bucket>/user_behaviour/[year]/[month]/[day]/[hour]/[gen]/ val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/") val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC) val halfDay: Duration = Duration.ofHours(12) val userBehaviourPaths: Seq[Path] = PathBuilder .latestGenHourlyPaths(userBehaviourBasePath, startDate, halfDay) val userBehaviourReader = ReaderBuilder(PARQUET) .withSchema(userBehaviourSchema) .withPath(userBehaviourPaths: _*) .buildReader() val df: DataFrame = userBehaviourReader.read()
  13. 13. Readers 13 val userBehaviourSchema: StructType = ??? val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/") val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC) val halfDay: Duration = Duration.ofHours(12) val userBehaviourReader = ReaderBuilder(PARQUET) .withSchema(userBehaviourSchema) .withHourlyPathBuilder(userBehaviourBasePath, startDate, halfDay) .buildReader() val df: DataFrame = userBehaviourReader.read() Spark Readers
  14. 14. Readers 14 val df: DataFrame = UserBehaviourReader.read(startDate, halfDay) Spark Readers
  15. 15. Transforms 15 Transformation def transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U] Dataset[T] ⇒ magic ⇒ Dataset[U]
  16. 16. Transforms 16 Transformation def withGreeting(df: DataFrame): DataFrame = { df.withColumn("greeting", lit("hello world")) } def extractFromJson(colName: String, outputColName: String, jsonSchema: StructType)(df: DataFrame): DataFrame = { df.withColumn(outputColName, from_json(col(colName), jsonSchema)) }
  17. 17. Transforms 17 Transformation def onlyClassifiedAds(df: DataFrame): DataFrame = { df.filter(col("event_type") === "View") .filter(col("object_type") === "ClassifiedAd") } def dropDuplicates(df: DataFrame): DataFrame = { df.dropDuplicates() } def cleanedCity(df: DataFrame): DataFrame = { df.withColumn("city", getCityUdf(col("object.location.address"))) } val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq( dropDuplicates, cleanedCity, onlyClassifiedAds ) val df: DataFrame = UserBehaviourReader.read(startDate, halfDay) val classifiedAdsDF = df.transforms(cleanupTransformations: _*)
  18. 18. Transforms 18 Transformation val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq( dropDuplicates, cleanedCity, onlyClassifiedAds ) val df: DataFrame = UserBehaviourReader.read(startDate, halfDay) val classifiedAdsDF = df.transforms(cleanupTransformations: _*) “As a data consumer, I only need to pick up which transformations I would like to apply, instead of coding them from scratch.” “It’s like cooking, engineers provide manufactured ingredients (transformations) and Data Scientists use the required ones for a successful receipt.”
  19. 19. Transforms - Links of Interest “DataFrame transformations can be defined with arguments so they don’t make assumptions about the schema of the underlying DataFrame.” - by Matthew Powers. bit.ly/Spark-ChainingTransformations bit.ly/Spark-SchemaIndependentTransformations 19 Transformation github.com/MrPowers/spark-daria “Spark helper methods to maximize developer productivity.”
  20. 20. 20 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark- testing-base provided by Holden Karau We will learn:
  21. 21. 21 Did you put untested Spark jobs into production? “Mars Climate Orbiter destroyed because of a Metric System Mixup (1999)”
  22. 22. Testing with Holden Karau 22 github.com/holdenk/spark-testing-base “Base classes to use when writing tests with Spark.” ● Share Spark Context between tests ● Provide methods to make tests easier ○ Fixture Readers ○ Json to DF converters ○ Extra validators github.com/MrPowers/spark-fast-tests “An alternative to spark-testing-base to run tests in parallel without restarting Spark Session after each test file.”
  23. 23. Testing SharedSparkContext 23 package com.holdenkarau.spark.testing import java.util.Date import org.apache.spark._ import org.scalatest.{BeforeAndAfterAll, Suite} /** * Shares a local `SparkContext` between all tests in a suite * and closes it at the end. You can share between suites by enabling * reuseContextIfPossible. */ trait SharedSparkContext extends BeforeAndAfterAll with SparkContextProvider { self: Suite => ... protected implicit def reuseContextIfPossible: Boolean = false ... }
  24. 24. Testing 24 package com.alpha.data.test trait SparkSuite extends DataFrameSuiteBase { self: Suite => override def reuseContextIfPossible: Boolean = true protected def createDF(data: Seq[Row], schema: StructType): DataFrame = { spark.createDataFrame(spark.sparkContext.parallelize(data), schema) } protected def jsonFixtureToDF(fileName: String, schema: Option[StructType] = None): DataFrame = { val fixtureContent = readFixtureContent(fileName) val fixtureJson = fixtureContentToJson(fixtureContent) jsonToDF(fixtureJson, schema) } protected def checkSchemas(inputSchema: StructType, expectedSchema: StructType): Unit = { assert(inputSchema.fields.sortBy(_.name).deep == expectedSchema.fields.sortBy(_.name).deep) } ... }
  25. 25. Testing 25 Spark Readers Transformation Spark WritersTask Context Testing each piece independently helps testing all together. Test
  26. 26. 26 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark- testing-base provided by Holden Karau We will learn:
  27. 27. 27 Are tests taking too long to execute?
  28. 28. Junit4Git by Raquel Pau 28 github.com/rpau/junit4git “Junit Extensions for Test Impact Analysis.” “This is a JUnit extension that ignores those tests that are not related with your last changes in your Git repository.”
  29. 29. Junit4Git - Gradle conf 29 configurations { agent } dependencies { testCompile("org.walkmod:scalatest4git_2.11:${version}") agent "org.walkmod:junit4git-agent:${version}" } test.doFirst { jvmArgs "-javaagent:${configurations.agent.singleFile}" } @RunWith(classOf[ScalaGitRunner])
  30. 30. Junit4Git- Travis with Git notes 30 before_install: - echo -e "machine github.comn login $GITHUB_TOKEN" >> ~/.netrc after_script: - git push origin refs/notes/tests:refs/notes/tests .travis.yml
  31. 31. Junit4Git- Runners 31 import org.scalatest.junit.JUnitRunner @RunWith(classOf[JUnitRunner]) abstract class UnitSpec extends FunSuite import org.scalatest.junit.ScalaGitRunner @RunWith(classOf[ScalaGitRunner]) abstract class IntegrationSpec extends FunSuite Tests to run always Tests to run on code changes
  32. 32. Junit4Git 32 Runner Listener Agent Test Method testStarted Starts the report loadClass testFinished addTestClass Closes the report Class instrumentClass
  33. 33. Junit4Git- Notes 33 afranzi:~/Projects/data-core$ git notes show [ { "test": "com.alpha.data.core.ContextSuite", "method": "Context Classes with CommonBucketConf should contain a commonBucket Field", "classes": [ "com.alpha.data.core.Context", "com.alpha.data.core.ContextSuite$$anonfun$1$$anon$1", "com.alpha.data.core.Context$", "com.alpha.data.core.CommonBucketConf$class", "com.alpha.data.core.Context$$anonfun$getConfigParamWithLogging$1" ] } ... ] com.alpha.data.core.ContextSuite > Context Classes with CommonBucketConf should contain a commonBucket Field SKIPPED com.alpha.data.core.ContextSuite > Context Classes with BucketConf should contain a bucket Field SKIPPED
  34. 34. Junit4Git - Links of Interest 34 bit.ly/SlidesJunit4Git Real Impact Testing Analysis For JVM bit.ly/TestImpactAnalysis The Rise of Test Impact Analysis
  35. 35. Summary 35 Use Transforms as pills Modularize your code Don’t duplicate code between Spark Jobs Don’t be afraid of testing everything Share all you built as a library, so others can reuse your code.
  36. 36. 36

×