Managing Millions of Tests Using Databricks

Managing millions of
tests using Databricks
Yin Huai
Databricks

Who am I?
• Yin Huai
Staff Software Engineer, Databricks
• Databricks Runtime group
Focusing on designing and building Databricks Runtime container environment, and
its associated testing and release infrastructures
• Apache Spark PMC member

Global-scale & multi-cloud data platform
Want to learn more about our experience building and scaling Databricks’ unified analytics platform?
Check out Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks from Jeff Pang

Data Platform
Deep technical stack
...
Customer Network Customer Network Customer Network Customer Network Customer Network
Kubernetes
HCVault, Consul, Prometheus, ELK, Jaeger, Grafana, common IAM, onboarding, billing, ...
Envoy, GraphQL
Cloud VMs, network, storage, databases
CM Master
Worker Worker
API Server
CM Master
CM Shard
API Server
API Server
API Server

Customer Network
Wide surface area
Data Lake
CSV,
JSON, TXT…
Kinesis
Customer Network
Data Lake
CSV,
JSON, TXT…
Kinesis
Customer Network
Data Lake
CSV,
JSON, TXT…
Kinesis
Customer Network
Data Lake
CSV,
JSON, TXT…
Kinesis
Customer Network
Data Lake
CSV,
JSON, TXT…
Kinesis
...
control plane
Collaborative Notebooks, AI
Streaming
Analytics Workflow scheduling Cluster management Admin & Security
Reporting,
Business Insights

Large scale of customer workloads
Millions of Databricks Runtime clusters
managed per day

Testing, testing, testing
• In-house CI system
(to replace Jenkins)
to execute tests at
scale
• Github webhook
receiver/consumer
to dispatch CI jobs
at scale
~2.8 test-years/day
~54 million tests/day
~630 tests/sec

Handle test results at scale?
• Tests fail every day
• If a test run fails
• 1 out of 1,000,000 runs: 54 failures/day
• 1 out of 100,000 runs: 540 failures/day
• …
• How to keep up?
~2.8 test-years/day
~54 million tests/day
~630 tests/sec

Build a system that automatically triages test
failures to the right owners in a developer-friendly
form

Guiding principles
• Automated: Test failures are collected and reported without
any manual interventions
• Connecting the problem with the right owner: The system
can make decisions on who should receive a report.
• Developer-friendly failure reporting: Build the workflow
around our Jira centric development workflow, curate
reports with the appropriate level of details, and empower
users to correct failure attribution

In the rest of this talk…
The data problem to solve
How to approach the problem and build
a solution
System overview
How to get everything implemented

What is the actual problem?
In-house CI
system
Jenkins
Jira
Code repositories
with Bazel as the
build tool
???

Building data pipelines
In-house CI
system
Collect test
results
Collect test
results
Jenkins
Code repositories
with Bazel as the
build tool
Collect test to
owner mapping
Jira
Report test
failures

• Hosting data pipelines
• Taking advantage of the unified analytics platform
• Loading CI systems’ results and Bazel build metadata
• Apache Spark’s data source APIs
• Storing datasets
• Delta makes continuous data ingestion simple
Use the right tools for solving the problem

Establishing test results tables
In-house CI
system
Collect test
results
Collect test
results
Jenkins
Code repositories
with Bazel as the
build tool
Collect test to
owner mapping
Jira
Report test
failures

• In-house CI system: Spark JDBC connector
• Jenkins: Spark Jenkins connector
val df = spark
.read
.format("com.databricks.sql.jenkins.JenkinsSource")
.option("host", ...)
.option("username", ...)
.option("passwordOrToken", ...)
.option("table", "jobs" | "builds" | "tests")
.option("builds.fetchLimit", 25) // optional
.load()
Support jobs, builds, and tests views
• Jobs view: query available jobs
• Builds view: query build statuses
• Tests view: query detailed test
results of selected builds (error
messages, stacktraces, and …)
exposed by JUnit Plugin

• Delta makes building the continuous data ingestion pipeline
easy
• Only ingest new results from CI systems using MERGE INTO
• Ingesting results from different Jenkins jobs in parallel into
the same destination table
• Rolling back to a recent version in case there is a bad write
with Delta Time Travel

Establishing test owners table
In-house CI
system
Collect test
results
Collect test
results
Jenkins
Code repositories
with Bazel as the
build tool
Collect test to
owner mapping
Jira
Report test
failures

• Bazel can output
structured (in xml) build
metadata for every build
target
Bazel query –output=xml
• Bazel build targets can
have user-specified
metadata, e.g. owners
<?xml version="1.1" encoding="UTF-8" standalone="no"?>
<query version="2">
<rule class="generic_scala_test" location="..." name="//foo/bar:MyTest">
<string name="name" value="MyTest"/>
<list name="visibility">
<label value="//visibility:public"/>
</list>
<list name="tags">
<string value="MyTag"/>
</list>
<string name="generator_name" value="MyTest"/>
...
<string name="size" value="medium"/>
<string name="scala_version" value="2.12"/>
<list name="suites">
<string value="com.databricks.MyTest"/>
</list>
<list name="owners">
<string value=”spark-env"/>
</list>
<list name="sys_props">
<string value="log4j.debug=true"/>
<string value="log4j.configuration=log4j.properties"/>
</list>
...
</rule>
</query>

• Test owners table includes:
• Test suite name (the test suite name appearing in
Junit test reports)
• The corresponding Jira component of the owner
• More fields provided by Bazel can be
easily added
Checkout repositories
Query Bazel
Parse XML records
Insert/Update Delta table

Reporting test failures to Jira
In-house CI
system
Collect test
results
Collect test
results
Jenkins
Code repositories
with Bazel as the
build tool
Collect test to
owner mapping
Jira
Report test
failures

Test reporting pipeline
Failure detector Failure analyzer Failure reporter
Test failure reports logs
Test owners table
Test results
tables
Jira
Ignore reported failures

• Test owner is not necessarily the owner of the failure
• Types of test failures
• Type 1. Testing environment has a problem: The owner of problem should own the failure.
• E.g., cloud provider errors and a staging service incidents
• Type 2. Failed because another test failed: Noise. No need to assign owner to this failure.
• This type represents test isolation problems, which should be eliminated.
• Type 3. Other causes: The owner of the test should own the failure.
Connecting the problem with the right owner
Failure analyzer
Failed tests
Type 1
failures
Type 2
failures
Type 3
failures
Failure reporter

• Two critical use cases to support
• Understand unique problems associated to given teams for a given time window
• Understand how a test is failing exactly for a given testing environment
• Two-layer reporting structure
• Parent Jira ticket: representing a unique problem, e.g., a test suite and a cloud provider
error.
• Subtask: representing individual failures happening in a specific testing environment,
e.g., all failures of a given test suite in the AWS staging workspace associated with
Databricks Runtime 8.1
• A new failure will find the right open parent ticket and subtask, and then make a new
comment
Developer-friendly failure reporting

com.databricks.FooBarSuite
com.databricks.FooBarSuite | DBR 8.1 | AWS Staging
com.databricks.FooBarSuite | DBR 8.2 | Azure Staging
com.databricks.FooBarSuite | DBR 8.3 | GCP Staging
Example ticket 2 (type 3)
Cloud Provider Error | VM Quota Exceeded
com.databricks.Suite1 | DBR 8.1 | AWS Staging
com.databricks.Suite2 | DBR 8.2 | Azure Staging
com.databricks.Suite3 | DBR 8.3 | GCP Staging
Example ticket 1 (type 1)

• Enable more types of automation
• Make critical issues standout: automatically escalate failures that match certain
criteria.
• Automatically assign affected version
• (Future) Automatically disable tests
• Developers can easily update test owners and update
the rules used to categorize test failures

Takeaways
Building automated data pipelines to
manage test results at scale
Databricks and Delta make the work
easy
Connecting test problems with the
right owners is key to make test
management process sustainable
Curating reports for different types of
personas makes processing
information surfaced from CI systems
easy

Next steps
Building holistic views of all CI/CD
activities
Gaining more insights from CI/CD
datasets to continuously guide
engineering practice
improvements
Join us!
https://databricks.com/careers

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Managing Millions of Tests Using Databricks

More Related Content

What's hot

Similar to Managing Millions of Tests Using Databricks

More from Databricks

Recently uploaded

Managing Millions of Tests Using Databricks