SlideShare a Scribd company logo
Anant Nag, LinkedIn
Shankar M, LinkedIn
Beyond unit tests: Testing for
Spark/Hadoop workflows
#EUde12
#EUde12
A day in the life of data engineer
● Produce D.E.D Report by 5 AM.
● At 1 AM, an alert goes off saying the pipeline has
failed
● Dev wakes up, curses his bad luck and starts backfill
job at high priority
● Finds cluster busy, starts killing jobs to make way for
D.E.D job
● Debugs failures, finds today is daylight savings day
and data partitions has gone haywire.
● Most days it works on retry
● Some days, we are not so lucky
NO D.E.D => We are D.E.A.D
#EUde12
Scale @ LinkedIn
• 10s of clusters running different versions of software
• 1000s of machines in each cluster
• 1000s of users
• 100s of 1000s of Azkaban workflows running per month
• Powers key business impacting features
– People you may know
– Who viewed my profile
#EUde12
● Cluster gets upgraded
● Data partition changes
● Code needs to be rewritten in a
new technology
● Different version of a dependent
jar is available
● ...
Nightmares at data street
#EUde12
Do you know your dependencies?
● Direct dependencies
● Indirect dependencies
● Hidden dependencies
● Semantic dependencies
“Hey, I am changing column X in data P to format B. Do you foresee any issues?”
Do you know your dependencies?
#EUde12
Paranoia is justified
No confidence to make changes
Lack of agility
Loss of innovation
Paranoia is justified
#EUde12
Architecture
● Workflow definition
● Test definitions
● Test execution environment:
○ Local
○ Production
● Test data
Architecture
#EUde12
hadoop {
workflow('workflow1') {
sparkJob('job1') {
uses 'com.linkedin.example.SparkJob’
executes ‘exampleSpark.jar’
jars ‘jar1.jar,jar2.jar’
executorMemory ‘2G’
numExecutors 400
}
sparkJob('job2') {
uses 'com.linkedin.example.SparkJob2’
executes ‘exampleSpark.jar’
depends 'job1'
}
targets 'job2'
}
}
Workflow definition
#EUde12
Test definition
hadoop {
workflow('countByCountryFlow') {
sparkJob('countByCountry') {
uses 'com.linkedin.example.SparkJob’
executes ‘exampleSpark.jar’
reads files: [
'input_data': "/data/input"
]
writes files: [
'output_path': "/jobs/output"
]
}
targets 'countByCountry'
}
}
hadoop {
workflowTestSuite("test1") {
addWorkflow('countByCountryFlow') {
}
}
workflowTestSuite("test2") {
...
}
}
Test definition
#EUde12
hadoop {
workflow('countByCountryFlow') {
sparkJob('countByCountry') {
uses 'com.linkedin.example.SparkJob’
executes ‘exampleSpark.jar’
reads files: [
'input_data': "/data/input"
]
writes files: [
'output_path': "/jobs/output"
]
}
targets 'countByCountry'
}
}
hadoop {
workflowTestSuite("test1") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
reads files: [
'input_data': '/path/to/test/data'
]
writes files: [
'output_path': '/path/to/test/output'
]
}
}
}
}
Overriding parameters
#EUde12
● 10s of clusters
● Multiple versions of Spark
● Some clusters update now, some later
● Code should run on all the versions
Configuration override
#EUde12
Configuration override
● Write multiple tests
● One test for each version of Spark
● Override Spark version in the tests
workflowTestSuite("testWithSpark1_6_3") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
set properties: [
‘spark.version’: ‘1.6.3’
]
}
}
}
workflowTestSuite("testWithSpark2_1_0") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
set properties: [
‘spark.version’: ‘2.1.0’
]
}
}
}
Configuration override
#EUde12
assert(x == y)
#EUde12
● Complex pipeline with 10s of Jobs
● Most of them are Spark Jobs
True story
#EUde12
● DataFrames API
● Rewrite all spark jobs to use DataFrames
● Is my new code ready for production??
● Write tests
○ Assertions on output
● All tests succeed after changes (^_^)
#EUde12
Data Validation and Assertion
● Types of checks
○ Record level checks
○ Aggregate level
■ Data transformation
■ Data aggregation
■ Data distribution
● Assert against Expectation
Data validation and assertion
#EUde12
Record level Validation
us 148083
in 46074
cn 34332
br 30836
gb 24387
fr 14983
...
hadoop {
workflowTestSuite(‘test1’) {
addWorkflow(‘countFlow’,’testCountFlow’){}
assertionWorkflow('assertNonNegativeCount')
{
sparkJob("assertNonNegativeCount") {
}
targets 'assertNonNegativeCount'
}
}
val count = spark.read.avro(input)
require(count.
map(r => r.getAs[Long](“count”)).
where(_ < 0 )).count() == 0)
Record level validation
#EUde12
Aggregated validation
● Binary spam classifier C1
● C1 classifies 30% of test input as spam
● Wrote a new classifier C2
● C2 can deviate at most 10%
● Write aggregated validations
Aggregated validation
#EUde12
Execution
#EUde12
Test execution
$> gradle azkabanTest -Ptestname=test1
● Test results on the terminal
● Reports for the passed and failed tests
Tests [1/1]:=> Flow TEST-countByCountry completed with status SUCCEEDED
Summary of the completed tests
Job Statistics for individual test
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Flow Name | Latest Exec ID | Status | Running | Succeeded | Failed | Ready | Cancelled | Disabled | Total
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
TEST-countByCountry | 2997645 | SUCCEEDED | 0 | 2 | 0 | 0 | 0 | 0 | 2
Test execution
#EUde12
Test automation
● Auto deployment process for Azkaban artifacts
● Tests run as part of the deployment
● Tests fail => Artifact deployment fails
● No un-tested code can go to production
Test automation
#EUde12
Test Data
#EUde12
Test data
● Real data
○ Very large data
○ Tests run very slow
● Randomly generated data
○ Not equivalent to real data
○ Real issues can never be catched
● Manually created data
○ Covers all the cases
○ Too much effort
● Samples
Test data
#EUde12
Requirements of test data
● Representative of real data
● Smaller in size
● Automated generation
● Sharable
● Discoverable
Requirements of sample data
The road ahead...
Road ahead
#EUde12
Flexible execution environment
● Sandboxed
● Replicate production settings
● Save and Restore
● Ability to run in a single box
Flexible execution environment
#EUde12
Data validation framework
● Validation logic in schema
● Automated validation on read and write
● Serves as contract between producers and consumers
Data validation Framework
#EUde12
Azkaban DSL: https://github.com/linkedin/linkedin-gradle-plugin-for-apache-hadoop
Azkaban: https://github.com/azkaban/azkaban

More Related Content

What's hot

Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Databricks
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Cobrix – a COBOL Data Source for Spark
Cobrix – a COBOL Data Source for SparkCobrix – a COBOL Data Source for Spark
Cobrix – a COBOL Data Source for Spark
DataWorks Summit
 

What's hot (20)

Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Cobrix – a COBOL Data Source for Spark
Cobrix – a COBOL Data Source for SparkCobrix – a COBOL Data Source for Spark
Cobrix – a COBOL Data Source for Spark
 

Similar to Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Anant Nag

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
DataWorks Summit
 
Gatling Performance Workshop
Gatling Performance WorkshopGatling Performance Workshop
Gatling Performance Workshop
Sai Krishna
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Sge
SgeSge
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
Engine Yard
 
Android training day 4
Android training day 4Android training day 4
Android training day 4
Vivek Bhusal
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
ScyllaDB
 
Ebs dba con4696_pdf_4696_0001
Ebs dba con4696_pdf_4696_0001Ebs dba con4696_pdf_4696_0001
Ebs dba con4696_pdf_4696_0001jucaab
 
Java 8 -12: da Oracle a Eclipse. Due anni e una rivoluzione
Java 8 -12: da Oracle a Eclipse. Due anni e una rivoluzioneJava 8 -12: da Oracle a Eclipse. Due anni e una rivoluzione
Java 8 -12: da Oracle a Eclipse. Due anni e una rivoluzione
ThinkOpen
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
prevota
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
Databricks
 
Pandora FMS: PostgreSQL Plugin
Pandora FMS: PostgreSQL PluginPandora FMS: PostgreSQL Plugin
Pandora FMS: PostgreSQL Plugin
Pandora FMS
 
Using Grails to Power your Electric Car
Using Grails to Power your Electric CarUsing Grails to Power your Electric Car
Using Grails to Power your Electric CarGR8Conf
 
How to test code with mruby
How to test code with mrubyHow to test code with mruby
How to test code with mruby
Hiroshi SHIBATA
 
Give Me My Damn Report: Making NoSQL Data Accessible to the Business
Give Me My Damn Report: Making NoSQL Data Accessible to the BusinessGive Me My Damn Report: Making NoSQL Data Accessible to the Business
Give Me My Damn Report: Making NoSQL Data Accessible to the Business
Formant
 

Similar to Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Anant Nag (20)

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
 
Gatling Performance Workshop
Gatling Performance WorkshopGatling Performance Workshop
Gatling Performance Workshop
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Sge
SgeSge
Sge
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
Android training day 4
Android training day 4Android training day 4
Android training day 4
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
 
Ebs dba con4696_pdf_4696_0001
Ebs dba con4696_pdf_4696_0001Ebs dba con4696_pdf_4696_0001
Ebs dba con4696_pdf_4696_0001
 
Java 8 -12: da Oracle a Eclipse. Due anni e una rivoluzione
Java 8 -12: da Oracle a Eclipse. Due anni e una rivoluzioneJava 8 -12: da Oracle a Eclipse. Due anni e una rivoluzione
Java 8 -12: da Oracle a Eclipse. Due anni e una rivoluzione
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Gradle como alternativa a maven
Gradle como alternativa a mavenGradle como alternativa a maven
Gradle como alternativa a maven
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 
Pandora FMS: PostgreSQL Plugin
Pandora FMS: PostgreSQL PluginPandora FMS: PostgreSQL Plugin
Pandora FMS: PostgreSQL Plugin
 
Using Grails to Power your Electric Car
Using Grails to Power your Electric CarUsing Grails to Power your Electric Car
Using Grails to Power your Electric Car
 
How to test code with mruby
How to test code with mrubyHow to test code with mruby
How to test code with mruby
 
Give Me My Damn Report: Making NoSQL Data Accessible to the Business
Give Me My Damn Report: Making NoSQL Data Accessible to the BusinessGive Me My Damn Report: Making NoSQL Data Accessible to the Business
Give Me My Damn Report: Making NoSQL Data Accessible to the Business
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 

Recently uploaded (20)

一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 

Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Anant Nag

  • 1. Anant Nag, LinkedIn Shankar M, LinkedIn Beyond unit tests: Testing for Spark/Hadoop workflows #EUde12
  • 2. #EUde12 A day in the life of data engineer ● Produce D.E.D Report by 5 AM. ● At 1 AM, an alert goes off saying the pipeline has failed ● Dev wakes up, curses his bad luck and starts backfill job at high priority ● Finds cluster busy, starts killing jobs to make way for D.E.D job ● Debugs failures, finds today is daylight savings day and data partitions has gone haywire. ● Most days it works on retry ● Some days, we are not so lucky NO D.E.D => We are D.E.A.D
  • 3. #EUde12 Scale @ LinkedIn • 10s of clusters running different versions of software • 1000s of machines in each cluster • 1000s of users • 100s of 1000s of Azkaban workflows running per month • Powers key business impacting features – People you may know – Who viewed my profile
  • 4. #EUde12 ● Cluster gets upgraded ● Data partition changes ● Code needs to be rewritten in a new technology ● Different version of a dependent jar is available ● ... Nightmares at data street
  • 5. #EUde12 Do you know your dependencies? ● Direct dependencies ● Indirect dependencies ● Hidden dependencies ● Semantic dependencies “Hey, I am changing column X in data P to format B. Do you foresee any issues?” Do you know your dependencies?
  • 6. #EUde12 Paranoia is justified No confidence to make changes Lack of agility Loss of innovation Paranoia is justified
  • 7. #EUde12 Architecture ● Workflow definition ● Test definitions ● Test execution environment: ○ Local ○ Production ● Test data Architecture
  • 8. #EUde12 hadoop { workflow('workflow1') { sparkJob('job1') { uses 'com.linkedin.example.SparkJob’ executes ‘exampleSpark.jar’ jars ‘jar1.jar,jar2.jar’ executorMemory ‘2G’ numExecutors 400 } sparkJob('job2') { uses 'com.linkedin.example.SparkJob2’ executes ‘exampleSpark.jar’ depends 'job1' } targets 'job2' } } Workflow definition
  • 9. #EUde12 Test definition hadoop { workflow('countByCountryFlow') { sparkJob('countByCountry') { uses 'com.linkedin.example.SparkJob’ executes ‘exampleSpark.jar’ reads files: [ 'input_data': "/data/input" ] writes files: [ 'output_path': "/jobs/output" ] } targets 'countByCountry' } } hadoop { workflowTestSuite("test1") { addWorkflow('countByCountryFlow') { } } workflowTestSuite("test2") { ... } } Test definition
  • 10. #EUde12 hadoop { workflow('countByCountryFlow') { sparkJob('countByCountry') { uses 'com.linkedin.example.SparkJob’ executes ‘exampleSpark.jar’ reads files: [ 'input_data': "/data/input" ] writes files: [ 'output_path': "/jobs/output" ] } targets 'countByCountry' } } hadoop { workflowTestSuite("test1") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { reads files: [ 'input_data': '/path/to/test/data' ] writes files: [ 'output_path': '/path/to/test/output' ] } } } } Overriding parameters
  • 11. #EUde12 ● 10s of clusters ● Multiple versions of Spark ● Some clusters update now, some later ● Code should run on all the versions Configuration override
  • 12. #EUde12 Configuration override ● Write multiple tests ● One test for each version of Spark ● Override Spark version in the tests workflowTestSuite("testWithSpark1_6_3") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { set properties: [ ‘spark.version’: ‘1.6.3’ ] } } } workflowTestSuite("testWithSpark2_1_0") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { set properties: [ ‘spark.version’: ‘2.1.0’ ] } } } Configuration override
  • 14. #EUde12 ● Complex pipeline with 10s of Jobs ● Most of them are Spark Jobs True story
  • 15. #EUde12 ● DataFrames API ● Rewrite all spark jobs to use DataFrames ● Is my new code ready for production?? ● Write tests ○ Assertions on output ● All tests succeed after changes (^_^)
  • 16. #EUde12 Data Validation and Assertion ● Types of checks ○ Record level checks ○ Aggregate level ■ Data transformation ■ Data aggregation ■ Data distribution ● Assert against Expectation Data validation and assertion
  • 17. #EUde12 Record level Validation us 148083 in 46074 cn 34332 br 30836 gb 24387 fr 14983 ... hadoop { workflowTestSuite(‘test1’) { addWorkflow(‘countFlow’,’testCountFlow’){} assertionWorkflow('assertNonNegativeCount') { sparkJob("assertNonNegativeCount") { } targets 'assertNonNegativeCount' } } val count = spark.read.avro(input) require(count. map(r => r.getAs[Long](“count”)). where(_ < 0 )).count() == 0) Record level validation
  • 18. #EUde12 Aggregated validation ● Binary spam classifier C1 ● C1 classifies 30% of test input as spam ● Wrote a new classifier C2 ● C2 can deviate at most 10% ● Write aggregated validations Aggregated validation
  • 20. #EUde12 Test execution $> gradle azkabanTest -Ptestname=test1 ● Test results on the terminal ● Reports for the passed and failed tests Tests [1/1]:=> Flow TEST-countByCountry completed with status SUCCEEDED Summary of the completed tests Job Statistics for individual test ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Flow Name | Latest Exec ID | Status | Running | Succeeded | Failed | Ready | Cancelled | Disabled | Total ------------------------------------------------------------------------------------------------------------------------------------------------------------------- TEST-countByCountry | 2997645 | SUCCEEDED | 0 | 2 | 0 | 0 | 0 | 0 | 2 Test execution
  • 21. #EUde12 Test automation ● Auto deployment process for Azkaban artifacts ● Tests run as part of the deployment ● Tests fail => Artifact deployment fails ● No un-tested code can go to production Test automation
  • 23. #EUde12 Test data ● Real data ○ Very large data ○ Tests run very slow ● Randomly generated data ○ Not equivalent to real data ○ Real issues can never be catched ● Manually created data ○ Covers all the cases ○ Too much effort ● Samples Test data
  • 24. #EUde12 Requirements of test data ● Representative of real data ● Smaller in size ● Automated generation ● Sharable ● Discoverable Requirements of sample data
  • 26. #EUde12 Flexible execution environment ● Sandboxed ● Replicate production settings ● Save and Restore ● Ability to run in a single box Flexible execution environment
  • 27. #EUde12 Data validation framework ● Validation logic in schema ● Automated validation on read and write ● Serves as contract between producers and consumers Data validation Framework