The document discusses how Jenkins and the Load Sharing Facility (LSF) were used to enable rapid delivery of device driver software. LSF was integrated with Jenkins to dynamically allocate hardware evaluation boards across software developers/testers, simplifying resource sharing and increasing automated regression testing throughput. This allowed tests to take advantage of idle boards and run in parallel, improving reliability, scalability, and reducing testing cycles. Benefits included eliminating manual board sharing, simplifying user access to boards, increasing system reliability, and shortening release cycles through faster automated regressions.
2. Jenkins World
#JenkinsWorld
Jenkins and Load Sharing Facility (LSF) Enables
Rapid Delivery of Device Driver Software
Brian Vandegriend
Product Verification Manager, Microsemi
Twitter: @BVandegriend
3. Jenkins World
#JenkinsWorld
Overview
How to dynamically allocate hardware evaluation boards across software
developers/testers and Jenkins in order to simplify resource sharing and
increase automated regression throughput
Agenda:
• Our Build/Test Environment
• Old/New Methods for Allocating Boards to Users/Jenkins
• LSF Concepts and Short Overview of the Tool
• Integrating LSF with Jenkins
• 4 Tips for Increasing Reliability and Throughput
• Benefits Realized by the Team
4. Jenkins World
#JenkinsWorld
Our Build and Test Environment
• Our group develops, tests and supports device
driver software/firmware for Optical/Ethernet
networking SOCs with 100 Gbps ports
– Device driver written in C with 150Kloc
– Subversion used for revision control
• Cloudbees Jenkins Platform is used to
continuously build and test the device driver
– 1 master with 2 slave nodes (VM/bare-metal)
– Production releases are shipped every 2 to 3 weeks
– Continuous Delivery: Release process is automated
except for posting to web portal (manual approval)
• Driver testing is done on lab-based boards
– Over 500 automated system-level tests that have a
runtime of 200+ board-hours
Packet
Generator /
Monitor
FPGA
Packets
SoC Evaluation / Test board
Intel-based
COMExpress
running Linux
5. Jenkins World
#JenkinsWorld
Old Method: Manual sharing of boards
Problems with approach:
• If an engineer is not using their board, it’s hard for other engineers
to use it.
• If an engineer’s assigned board is used by someone else, they
typically hunt around trying to find a free board.
• Software regressions can’t take advantage of idle boards and
run in parallel.
10 10
10
20
SW Lab
(Vancouver)
Boards are
manually assigned
to engineers
Regressions are
sta2cally assigned
…
25 Jobs
4 sites
50 engineers
6. Jenkins World
#JenkinsWorld #JenkinsWorld
SW Lab
New Method: Use IBM’s Load Sharing Facility (LSF) tool to
dynamically allocate boards
2 Tool dispatches tests to free boards
LSF
3
Board becomes free
when test finishes 4
Jenkins uses lower priority queue
to make use of idle boards
1 User submits test to queue (FIF0)
…
25
Jobs
Advantages of using a queue-based solution, such as LSF:
• Enhanced productivity − users do not have to find/reserve boards. The
system will grant a free board to the user based upon their needs.
• Higher reliability − "problematic" boards are taken offline and the system
will direct jobs to the other boards. Individual users are not impacted.
• Scalability − as users/boards are added, no re-adjustment of board
assignments is necessary.
• Shorter automated testing cycles and higher efficiency of boards
(Board can only run
one test at a 2me)
7. Jenkins World
#JenkinsWorld
Comparing 2 Solutions:
Running LSF versus using Jenkins Slaves on Boards
Running LSF on Boards
• Board resources are treated as one large
resource pool
• If a board crashes, only 1 test result is
lost
• Test balancing across jobs is not required
as tests are dynamically allocated to
boards by LSF
• LSF can allocate all boards to automated
tests when users are not using them
Using boards as Jenkins Slaves
• Boards are divided into 2 groups: 1 for
Jenkins slaves and 1 for users
• If a board crashes, all test results are
lost
• Tests need to be equally partitioned
across jobs to maximize throughput
• Jenkins can’t take advantage of free
boards that users are not using
LSF
…
25
…
10
…
15 Slaves LSF Hosts
40
Job 1
40
Job 10 Jenkins
Scheduler
.
.
.
400
8. Jenkins World
#JenkinsWorld
Prerequisites for using LSF for Sharing Boards
• Boards have a version of Linux installed (CentOs, RedHat, Fedora, and
so on) à LSF sees each board as a Linux Server
• Boards are fairly homogenous in their configuration/hardware
– Many different board types will lead to a fragmented pool
o LSF can easily handle multiple resource types and allocate jobs based on
resource requests
– For our project, we ensured we had chip fuse overrides that were
controllable through SW
• Requires users to close their debug sessions when finished to allow the
board to be allocated to the next user
• Timeouts are enforced by LSF to ensure boards are returned back to the pool
• Successful adoption by the team relies on individuals to use the
system and not circumvent it by logging directly into boards
8
9. Jenkins World
#JenkinsWorld
LSF Concepts
• Each queue can enforce user limits and run times
– Short queue typically has a job limit of 1, run time of 1 hour and highest priority
– Long queue typically allows multiple jobs per user and has the lowest priority
• LSF uses a priority-based, fair-share algorithm to dispatch jobs to hosts
• Each host has a number of attributes which can be requested
Cluster
…
Hosts
…
Hosts Proj-1
Proj-2
Queues
short
normal
long
Resource A2ributes
- Proj-1
- atom_cpu
- Num_devices=3
- Greenhills
- Fedora_Linux
- Proj-2
- i5_cpu
- Num_devices=2
- Fedora_Linux
LSF daemons run on host
(sbatchd/res/lim)
10. Jenkins World
#JenkinsWorld
Submitting Jobs using LSF
Running an interactive command on a board through LSF:
Ø bsub -Ip -q short -R Proj1 echo "hello world!“
Job <623549> is submitted to queue <short>.
<<Starting on board-105>>
hello world!
Request specific resource constraints:
Ø bsub –R “Proj1 && i5_cpu” –R “num_devices>=2” xterm
Ø bsub –q long –m board_105 test_cmd ; # requests a specific board
To view job status:
> bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME
549269 vandegr RUN short swbuild01 board-105 xterm
11. Jenkins World
#JenkinsWorld
Monitoring Jobs/Queues/Hosts using LSF
To view the list of queues and their status:
Ø bqueues
QUEUE_NAME PRIO STATUS JL/U NJOBS PEND RUN
short 60 Open:Active 1 1 0 1
normal 25 Open:Active 5 7 3 4
long 10 Open:Active 10 33 18 15
To view the host status:
Ø Bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN
swbuild01 ok - 4 2 2
board-105 closed - 1 1 1
board-106 unavail - 1 0 0
12. Jenkins World
#JenkinsWorld
Boards/Jobs Administration using LSF
# Print out boards in LSF cluster:
> lshosts
HOST_NAME model cpuf ncpus maxmem maxswp RESOURCES
Board-73 Atom 10.0 1 1.8G 3.9G (proj1 atom)
Board-38 Intel_i5 40.0 2 5.6G 3.9G (proj2 i5)
# Take board offline:
Ø badmin hclose –C “hardware issue” board-38
Kill/suspend/resume a job running in LSF:
Ø bkill/bstop/bresume <job_id>
13. Jenkins World
#JenkinsWorld
Integrating LSF with Jenkins for Automated Testing
SW Lab
2 Script dispatches tests to LSF as individual
jobs with resource constraints
LSF
4 Script waits un2l all
tests are finished
5
1
Jenkins job starts
run_regression python
script that creates test list
…
25 Job
3 Script scans LSF status
and log files to print
out real-2me status to
console log
Script converts test log files
into JUnit format which is
summarized by Jenkins
14. Jenkins World
#JenkinsWorld
Enabling User-based Automated Testing through Jenkins
• Regression script is also used in
Jenkins for users to run regression
tests against their SW/FW changes
• Users can submit their project
workspaces to Jenkins for testing
• Helps ensure that the software trunk
remains “green”
• Parameterized build is used to pass
variables to the Python script
• LSF runs user regressions in parallel
with the main (trunk) SW regressions
Making QA regressions available to all developers!
15. Jenkins World
#JenkinsWorld
Tip: Using LSF to prevent corrupted boards from
killing your automated regression
Problem: Boards in a bad state will causes tests to fail in rapid succession
Solution: Automatically take boards offline that rapidly fail tests:
• LSF monitors the job exit rate for boards, and closes the board if the rate
exceeds a configurable threshold
– For example, LSF takes a board offline if 5 tests exit abnormally in less than 30 sec
• Run a board initialization script prior to running the test
– LSF executes a pre-execution script that puts the board into a known good state
– If the board initialization script returns an error code, LSF takes the board offline
and finds a new board for the test
• Re-queue tests that fail because of a board issue
– LSF will re-run a test if it exits with a special error code(such as 99)
This technique has increased overall reliability of the system!
16. Jenkins World
#JenkinsWorld
Tip: Automatically Reboot/Power-cycle Bad Boards
Problem:
• Boards taken offline by LSF need to be quickly brought back online to maximize
test throughput
Solution:
• A board monitoring script, which is run periodically by Jenkins, will reset/
power-cycle offline boards and bring them back online
– If a soft reset fails, then an Ethernet-based remote power switch (from
Digital Loggers) is used to power-cycle the board
– If power-cycling fails, the Jenkins job fails with an error and e-mail notification is
sent to the administrator
17. Jenkins World
#JenkinsWorld
Tip: Qualify Boards Prior to using Them in Automated
Regression Testing
Problem:
• A small percentage of our boards will have manufacturing defects (primarily
due to the SoC devices being removed/re-soldered
• Boards with hardware faults will cause a small number of tests to fail
intermittently/consistently, which takes a lot of time to track down
Solution:
• As boards are added, run a battery of working tests with a stable SW release on
each board with 5 to 10 iterations
• Weed out problematic/failing boards and use the rest for automated testing
– Use LSF groups to create a group of “golden” boards
18. Jenkins World
#JenkinsWorld
Tip: Handling Flaky Tests
Problem:
• Flaky tests (those that can fail or pass with the same software code) make it
difficult for automated regressions to be “green” (no failures)
• Our system suffers from flaky tests, similar to what is experienced by Google
– Google Test blog article: “We see a continual rate of about 1.5% of all test runs
reporting a "flaky" result. There are many root causes why tests return flaky
results, including concurrency, relying on non-deterministic or undefined behaviors,
flaky third party code, infrastructure problems, etc.”
Workaround:
• Regression script re-runs test failures 3 times and if all re-runs have passed,
then the test result is changed from “failed” to “skipped”
– Test result is only modified if the test has been tagged as a flaky test
– This method also helps to quickly identify new test failures as consistent or
intermittent failures.
19. Jenkins World
#JenkinsWorld
Tip: Handling Flaky Tests (continued)
• Flaky tests in Jenkins can be represented by using the “Skip” column
– It doesn’t seem possible to have Jenkins recognize new values for the TEST_RESULT
property: <property name="TEST_RESULT" value="SKIP"/>
20. Jenkins World
#JenkinsWorld
Benefits realized by adopting LSF
• Our solution using LSF and Jenkins for Automated Regression
testing has:
– Eliminated manual sharing of boards à tedious and inefficient
– Simplifies a users’ experience of finding a free board
– Increased reliability of the system through automatic power-cycling
of corrupted boards and running a board initialization script
– Improved quality of code commits as users run automated regression
tests on their changes through Jenkins/LSF before committing
– Shortened release cycles as automated regressions complete faster
by gaining access to more boards