Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1 of 28

Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Anant Nag

2

Share

Download to read offline

As a Spark developer, do you want to quickly develop your Spark workflows? Do you want to test your workflows in a sandboxed environment similar to production? Do you want to write end-to-end tests for your workflows and add assertions on top of it? In just a few years, the number of users writing Spark jobs at LinkedIn have grown from tens to hundreds, and the number of jobs running every day has grown from hundreds to thousands. With the ever increasing number of users and jobs, it becomes crucial to reduce the development time for these jobs. It is also important to test these jobs thoroughly before they go to production. Currently, there is no way users can test their spark jobs end-to-end. The only way is to divide the spark jobs into functions and unit test the functions. We’ve tried to address these issues by creating a testing framework for Spark workflows. The testing framework enables the users to run their jobs in an environment similar to the production environment and on the data which is sampled from the original data. The testing framework consists of a test deployment system, a data generation pipeline to generate the sampled data, a data management system to help users manage and search the sampled data and an assertion engine to validate the test output. In this talk, we will discuss the motivation behind the testing framework before deep diving into its design. We will further discuss how the testing framework is helping the Spark users at LinkedIn to be more productive.

Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Anant Nag

  1. 1. Anant Nag, LinkedIn Shankar M, LinkedIn Beyond unit tests: Testing for Spark/Hadoop workflows #EUde12
  2. 2. #EUde12 A day in the life of data engineer ● Produce D.E.D Report by 5 AM. ● At 1 AM, an alert goes off saying the pipeline has failed ● Dev wakes up, curses his bad luck and starts backfill job at high priority ● Finds cluster busy, starts killing jobs to make way for D.E.D job ● Debugs failures, finds today is daylight savings day and data partitions has gone haywire. ● Most days it works on retry ● Some days, we are not so lucky NO D.E.D => We are D.E.A.D
  3. 3. #EUde12 Scale @ LinkedIn • 10s of clusters running different versions of software • 1000s of machines in each cluster • 1000s of users • 100s of 1000s of Azkaban workflows running per month • Powers key business impacting features – People you may know – Who viewed my profile
  4. 4. #EUde12 ● Cluster gets upgraded ● Data partition changes ● Code needs to be rewritten in a new technology ● Different version of a dependent jar is available ● ... Nightmares at data street
  5. 5. #EUde12 Do you know your dependencies? ● Direct dependencies ● Indirect dependencies ● Hidden dependencies ● Semantic dependencies “Hey, I am changing column X in data P to format B. Do you foresee any issues?” Do you know your dependencies?
  6. 6. #EUde12 Paranoia is justified No confidence to make changes Lack of agility Loss of innovation Paranoia is justified
  7. 7. #EUde12 Architecture ● Workflow definition ● Test definitions ● Test execution environment: ○ Local ○ Production ● Test data Architecture
  8. 8. #EUde12 hadoop { workflow('workflow1') { sparkJob('job1') { uses 'com.linkedin.example.SparkJob’ executes ‘exampleSpark.jar’ jars ‘jar1.jar,jar2.jar’ executorMemory ‘2G’ numExecutors 400 } sparkJob('job2') { uses 'com.linkedin.example.SparkJob2’ executes ‘exampleSpark.jar’ depends 'job1' } targets 'job2' } } Workflow definition
  9. 9. #EUde12 Test definition hadoop { workflow('countByCountryFlow') { sparkJob('countByCountry') { uses 'com.linkedin.example.SparkJob’ executes ‘exampleSpark.jar’ reads files: [ 'input_data': "/data/input" ] writes files: [ 'output_path': "/jobs/output" ] } targets 'countByCountry' } } hadoop { workflowTestSuite("test1") { addWorkflow('countByCountryFlow') { } } workflowTestSuite("test2") { ... } } Test definition
  10. 10. #EUde12 hadoop { workflow('countByCountryFlow') { sparkJob('countByCountry') { uses 'com.linkedin.example.SparkJob’ executes ‘exampleSpark.jar’ reads files: [ 'input_data': "/data/input" ] writes files: [ 'output_path': "/jobs/output" ] } targets 'countByCountry' } } hadoop { workflowTestSuite("test1") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { reads files: [ 'input_data': '/path/to/test/data' ] writes files: [ 'output_path': '/path/to/test/output' ] } } } } Overriding parameters
  11. 11. #EUde12 ● 10s of clusters ● Multiple versions of Spark ● Some clusters update now, some later ● Code should run on all the versions Configuration override
  12. 12. #EUde12 Configuration override ● Write multiple tests ● One test for each version of Spark ● Override Spark version in the tests workflowTestSuite("testWithSpark1_6_3") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { set properties: [ ‘spark.version’: ‘1.6.3’ ] } } } workflowTestSuite("testWithSpark2_1_0") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { set properties: [ ‘spark.version’: ‘2.1.0’ ] } } } Configuration override
  13. 13. #EUde12 assert(x == y)
  14. 14. #EUde12 ● Complex pipeline with 10s of Jobs ● Most of them are Spark Jobs True story
  15. 15. #EUde12 ● DataFrames API ● Rewrite all spark jobs to use DataFrames ● Is my new code ready for production?? ● Write tests ○ Assertions on output ● All tests succeed after changes (^_^)
  16. 16. #EUde12 Data Validation and Assertion ● Types of checks ○ Record level checks ○ Aggregate level ■ Data transformation ■ Data aggregation ■ Data distribution ● Assert against Expectation Data validation and assertion
  17. 17. #EUde12 Record level Validation us 148083 in 46074 cn 34332 br 30836 gb 24387 fr 14983 ... hadoop { workflowTestSuite(‘test1’) { addWorkflow(‘countFlow’,’testCountFlow’){} assertionWorkflow('assertNonNegativeCount') { sparkJob("assertNonNegativeCount") { } targets 'assertNonNegativeCount' } } val count = spark.read.avro(input) require(count. map(r => r.getAs[Long](“count”)). where(_ < 0 )).count() == 0) Record level validation
  18. 18. #EUde12 Aggregated validation ● Binary spam classifier C1 ● C1 classifies 30% of test input as spam ● Wrote a new classifier C2 ● C2 can deviate at most 10% ● Write aggregated validations Aggregated validation
  19. 19. #EUde12 Execution
  20. 20. #EUde12 Test execution $> gradle azkabanTest -Ptestname=test1 ● Test results on the terminal ● Reports for the passed and failed tests Tests [1/1]:=> Flow TEST-countByCountry completed with status SUCCEEDED Summary of the completed tests Job Statistics for individual test ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Flow Name | Latest Exec ID | Status | Running | Succeeded | Failed | Ready | Cancelled | Disabled | Total ------------------------------------------------------------------------------------------------------------------------------------------------------------------- TEST-countByCountry | 2997645 | SUCCEEDED | 0 | 2 | 0 | 0 | 0 | 0 | 2 Test execution
  21. 21. #EUde12 Test automation ● Auto deployment process for Azkaban artifacts ● Tests run as part of the deployment ● Tests fail => Artifact deployment fails ● No un-tested code can go to production Test automation
  22. 22. #EUde12 Test Data
  23. 23. #EUde12 Test data ● Real data ○ Very large data ○ Tests run very slow ● Randomly generated data ○ Not equivalent to real data ○ Real issues can never be catched ● Manually created data ○ Covers all the cases ○ Too much effort ● Samples Test data
  24. 24. #EUde12 Requirements of test data ● Representative of real data ● Smaller in size ● Automated generation ● Sharable ● Discoverable Requirements of sample data
  25. 25. The road ahead... Road ahead
  26. 26. #EUde12 Flexible execution environment ● Sandboxed ● Replicate production settings ● Save and Restore ● Ability to run in a single box Flexible execution environment
  27. 27. #EUde12 Data validation framework ● Validation logic in schema ● Automated validation on read and write ● Serves as contract between producers and consumers Data validation Framework
  28. 28. #EUde12 Azkaban DSL: https://github.com/linkedin/linkedin-gradle-plugin-for-apache-hadoop Azkaban: https://github.com/azkaban/azkaban

×