Successfully reported this slideshow.
Your SlideShare is downloading. ×

Managing Millions of Tests Using Databricks

Ad

Managing millions of
tests using Databricks
Yin Huai
Databricks

Ad

Who am I?
• Yin Huai
Staff Software Engineer, Databricks
• Databricks Runtime group
Focusing on designing and building Dat...

Ad

Global-scale & multi-cloud data platform
Want to learn more about our experience building and scaling Databricks’ unified ...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 31 Ad
1 of 31 Ad

Managing Millions of Tests Using Databricks

Download to read offline

Databricks Runtime is the execution environment that powers millions of VMs running data engineering and machine learning workloads daily in Databricks. Inside Databricks, we run millions of tests per day to ensure the quality of different versions of Databricks Runtime. Due to the large number of tests executed daily, we have been continuously facing the challenge of effective test result monitoring and problem triaging. In this talk, I am going to share our experience of building the automated test monitoring and reporting system using Databricks. I will cover how we ingest data from different data sources like CI systems and Bazel build metadata to Delta, and how we analyze test results and report failures to their owners through Jira. I will also show you how this system empowers us to build different types of reports that effectively track the quality of changes made to Databricks Runtime.

Databricks Runtime is the execution environment that powers millions of VMs running data engineering and machine learning workloads daily in Databricks. Inside Databricks, we run millions of tests per day to ensure the quality of different versions of Databricks Runtime. Due to the large number of tests executed daily, we have been continuously facing the challenge of effective test result monitoring and problem triaging. In this talk, I am going to share our experience of building the automated test monitoring and reporting system using Databricks. I will cover how we ingest data from different data sources like CI systems and Bazel build metadata to Delta, and how we analyze test results and report failures to their owners through Jira. I will also show you how this system empowers us to build different types of reports that effectively track the quality of changes made to Databricks Runtime.

More Related Content

Slideshows for you (19)

Similar to Managing Millions of Tests Using Databricks (20)

More from Databricks (20)

Managing Millions of Tests Using Databricks

  1. 1. Managing millions of tests using Databricks Yin Huai Databricks
  2. 2. Who am I? • Yin Huai Staff Software Engineer, Databricks • Databricks Runtime group Focusing on designing and building Databricks Runtime container environment, and its associated testing and release infrastructures • Apache Spark PMC member
  3. 3. Global-scale & multi-cloud data platform Want to learn more about our experience building and scaling Databricks’ unified analytics platform? Check out Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks from Jeff Pang
  4. 4. Data Platform Deep technical stack ... Customer Network Customer Network Customer Network Customer Network Customer Network Kubernetes HCVault, Consul, Prometheus, ELK, Jaeger, Grafana, common IAM, onboarding, billing, ... Envoy, GraphQL Cloud VMs, network, storage, databases CM Master Worker Worker API Server CM Master CM Shard API Server API Server API Server Want to learn more about our experience building and scaling Databricks’ unified analytics platform? Check out Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks from Jeff Pang
  5. 5. Customer Network Wide surface area Data Lake CSV, JSON, TXT… Kinesis Customer Network Data Lake CSV, JSON, TXT… Kinesis Customer Network Data Lake CSV, JSON, TXT… Kinesis Customer Network Data Lake CSV, JSON, TXT… Kinesis Customer Network Data Lake CSV, JSON, TXT… Kinesis ... control plane Collaborative Notebooks, AI Streaming Analytics Workflow scheduling Cluster management Admin & Security Reporting, Business Insights Want to learn more about our experience building and scaling Databricks’ unified analytics platform? Check out Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks from Jeff Pang
  6. 6. Large scale of customer workloads Millions of Databricks Runtime clusters managed per day
  7. 7. Testing, testing, testing • In-house CI system (to replace Jenkins) to execute tests at scale • Github webhook receiver/consumer to dispatch CI jobs at scale ~2.8 test-years/day ~54 million tests/day ~630 tests/sec
  8. 8. Handle test results at scale? • Tests fail every day • If a test run fails • 1 out of 1,000,000 runs: 54 failures/day • 1 out of 100,000 runs: 540 failures/day • … • How to keep up? ~2.8 test-years/day ~54 million tests/day ~630 tests/sec
  9. 9. Build a system that automatically triages test failures to the right owners in a developer-friendly form
  10. 10. Guiding principles • Automated: Test failures are collected and reported without any manual interventions • Connecting the problem with the right owner: The system can make decisions on who should receive a report. • Developer-friendly failure reporting: Build the workflow around our Jira centric development workflow, curate reports with the appropriate level of details, and empower users to correct failure attribution
  11. 11. In the rest of this talk… The data problem to solve How to approach the problem and build a solution System overview How to get everything implemented
  12. 12. The data problem to solve
  13. 13. What is the actual problem? In-house CI system Jenkins Jira Code repositories with Bazel as the build tool ???
  14. 14. Building data pipelines In-house CI system Collect test results Collect test results Jenkins Code repositories with Bazel as the build tool Collect test to owner mapping Jira Report test failures
  15. 15. System overview
  16. 16. • Hosting data pipelines • Taking advantage of the unified analytics platform • Loading CI systems’ results and Bazel build metadata • Apache Spark’s data source APIs • Storing datasets • Delta makes continuous data ingestion simple Use the right tools for solving the problem
  17. 17. Establishing test results tables In-house CI system Collect test results Collect test results Jenkins Code repositories with Bazel as the build tool Collect test to owner mapping Jira Report test failures
  18. 18. • In-house CI system: Spark JDBC connector • Jenkins: Spark Jenkins connector Establishing test results tables val df = spark .read .format("com.databricks.sql.jenkins.JenkinsSource") .option("host", ...) .option("username", ...) .option("passwordOrToken", ...) .option("table", "jobs" | "builds" | "tests") .option("builds.fetchLimit", 25) // optional .load() Support jobs, builds, and tests views • Jobs view: query available jobs • Builds view: query build statuses • Tests view: query detailed test results of selected builds (error messages, stacktraces, and …) exposed by JUnit Plugin
  19. 19. • Delta makes building the continuous data ingestion pipeline easy • Only ingest new results from CI systems using MERGE INTO • Ingesting results from different Jenkins jobs in parallel into the same destination table • Rolling back to a recent version in case there is a bad write with Delta Time Travel Establishing test results tables
  20. 20. Establishing test owners table In-house CI system Collect test results Collect test results Jenkins Code repositories with Bazel as the build tool Collect test to owner mapping Jira Report test failures
  21. 21. • Bazel can output structured (in xml) build metadata for every build target Bazel query –output=xml • Bazel build targets can have user-specified metadata, e.g. owners Establishing test owners table <?xml version="1.1" encoding="UTF-8" standalone="no"?> <query version="2"> <rule class="generic_scala_test" location="..." name="//foo/bar:MyTest"> <string name="name" value="MyTest"/> <list name="visibility"> <label value="//visibility:public"/> </list> <list name="tags"> <string value="MyTag"/> </list> <string name="generator_name" value="MyTest"/> ... <string name="size" value="medium"/> <string name="scala_version" value="2.12"/> <list name="suites"> <string value="com.databricks.MyTest"/> </list> <list name="owners"> <string value=”spark-env"/> </list> <list name="sys_props"> <string value="log4j.debug=true"/> <string value="log4j.configuration=log4j.properties"/> </list> ... </rule> </query>
  22. 22. • Test owners table includes: • Test suite name (the test suite name appearing in Junit test reports) • The corresponding Jira component of the owner • More fields provided by Bazel can be easily added Establishing test owners table Checkout repositories Query Bazel Parse XML records Insert/Update Delta table
  23. 23. Reporting test failures to Jira In-house CI system Collect test results Collect test results Jenkins Code repositories with Bazel as the build tool Collect test to owner mapping Jira Report test failures
  24. 24. Test reporting pipeline Failure detector Failure analyzer Failure reporter Test failure reports logs Test owners table Test results tables Jira Ignore reported failures
  25. 25. • Test owner is not necessarily the owner of the failure • Types of test failures • Type 1. Testing environment has a problem: The owner of problem should own the failure. • E.g., cloud provider errors and a staging service incidents • Type 2. Failed because another test failed: Noise. No need to assign owner to this failure. • This type represents test isolation problems, which should be eliminated. • Type 3. Other causes: The owner of the test should own the failure. Connecting the problem with the right owner Failure analyzer Failed tests Type 1 failures Type 2 failures Type 3 failures Failure reporter
  26. 26. • Two critical use cases to support • Understand unique problems associated to given teams for a given time window • Understand how a test is failing exactly for a given testing environment • Two-layer reporting structure • Parent Jira ticket: representing a unique problem, e.g., a test suite and a cloud provider error. • Subtask: representing individual failures happening in a specific testing environment, e.g., all failures of a given test suite in the AWS staging workspace associated with Databricks Runtime 8.1 • A new failure will find the right open parent ticket and subtask, and then make a new comment Developer-friendly failure reporting
  27. 27. Developer-friendly failure reporting com.databricks.FooBarSuite com.databricks.FooBarSuite | DBR 8.1 | AWS Staging com.databricks.FooBarSuite | DBR 8.2 | Azure Staging com.databricks.FooBarSuite | DBR 8.3 | GCP Staging Example ticket 2 (type 3) Cloud Provider Error | VM Quota Exceeded com.databricks.Suite1 | DBR 8.1 | AWS Staging com.databricks.Suite2 | DBR 8.2 | Azure Staging com.databricks.Suite3 | DBR 8.3 | GCP Staging Example ticket 1 (type 1)
  28. 28. • Enable more types of automation • Make critical issues standout: automatically escalate failures that match certain criteria. • Automatically assign affected version • (Future) Automatically disable tests • Developers can easily update test owners and update the rules used to categorize test failures Developer-friendly failure reporting
  29. 29. Takeaways Building automated data pipelines to manage test results at scale Databricks and Delta make the work easy Connecting test problems with the right owners is key to make test management process sustainable Curating reports for different types of personas makes processing information surfaced from CI systems easy
  30. 30. Next steps Building holistic views of all CI/CD activities Gaining more insights from CI/CD datasets to continuously guide engineering practice improvements Join us! https://databricks.com/careers
  31. 31. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×