Flaky tests and bugs in Apache software (e.g. Hadoop)

Copyright© 2016 NTT Corp. All Rights Reserved.
Flaky Tests and Bugs in
Apache Software (e.g. Hadoop)
Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
NTT Software Innovation Center
ApacheCon Core North America (May 12, 2016, at Vancouver)

2Copyright© 2016 NTT Corp. All Rights Reserved.
• Software Engineer at NTT Corporation
• NTT: the largest telecom in Japan
• Engaged in improvement on reliability of
distributed systems
• Some contributions to ZooKeeper / Hadoop
including critical bug fixes (non-committer)
• github: https://github.com/AkihiroSuda
Who am I

• Current "flakiness" in Apache software
• Why flaky test matters?
• What causes a flaky test?
• How can we find, reproduce, and fix a flaky test?
• Existing work at Apache communities
• Our work: Namazu(鯰, catfish)
https://github.com/osrg/namazu
Agenda

Agenda

Good News: Apache software are well tested!
Software Production code (LOC) Test code (LOC)
MapReduce 95K 87K
YARN 178K 121K
HDFS 152K 150K
ZooKeeper 33K 27K
HBase 571K 222K
Spark 167K 128K
Flume 46K 34K
Cassandra 168K 78K
Data are measured at 14/01/2016, using CLOC
Prod Test

Bad News: https://builds.apache.org/job/%s-trunk/
MapReduce YARN HDFS
ZooKeeper
Data are captured at 14/01/2016
HBase Build
Build Time
Blue = Success
Red = Failure
I've never seen fully successful Hadoop build,
even on my local machine...

Bad News: JIRA QL: project = ? AND text ~ "test fail*"
Software #Matched #All
Issues
MapReduce 2,441 (38%) 6,373
YARN 2,290 (63%) 4,756
HDFS 5,141 (53%) 9,672
ZooKeeper 828 (35%) 2,384
HBase 6,595 (42%) 15,542
Spark 794 ( 6%) 14,047
Flume 342 (12%) 2,882
Cassandra 1,656 (15%) 11,430
Roughly speaking,
the half of
Hadoop development
is dedicated to
debugging test failures.
Interestingly,
its flakiness seems
not uniform
across software..
(discussed later)
just for approximation

Agenda

97% unit test failures in Apache software are said to be
harmless for production ("false-alarm")
• Information source:
"An Empirical Study of Bugs in Test Code" (A.Vahabzadeh et al., ICSME'15)
Not all test failures are critical for production..

It still matters!
For developers..
It's a barrier to promotion of CI
• If many tests are flaky, developers tend to ignore CI
failure  overlook real bugs
It's also a psychological barrier to contribution
• A developer may be blamed due to a test failure
For users..
It's a barrier to risk assessment for production
• No one can tell flaky tests from real bugs
So flaky test doesn't matter, as it doesn't affect production?

SemaphoreCI suggests "No broken windows" strategy
for flaky tests
https://semaphoreci.com/community/tutorials/how-to-deal-with-and-eliminate-flaky-tests
So flaky test doesn't matter, as it doesn't affect production?
image: http://guides.lib.jjay.cuny.edu/nypd/brokenwindows

Agenda

• Typical flaky test is caused by a malformed async
operation like this
(A.Vahabzadeh et al., ICSME'15 / Q.Luo et al., ACM FSE'14 / YARN-4478)
• Basically it can be fixed by increasing timeout&retries
• But it's not easy to find a reasonable timeout value
(e.g. YARN-{4804, 4807, 4929...})
• Long timeout is expensive
Basic cause: async operation
invokeAsyncOperation();
// some tests lack even this sleep
sleep(certainHardcodedTimeout);
assertTrue(checkSomethingGoodHasHappened());

• Host configuration
• Host performance
• Docker is great! But it still has some
issues
Testbed (e.g. CI) can cause test failures as well

• HADOOP-12687
• Many YARN test fails when /etc/hosts has multiple loopback
entries
• ZOOKEEPER-2252
• Test: nslookup("a") should fail
• It does not fail when there is actually the host named "a“
• INFRA-11811
• JDK was not set up properly in a Jenkins slave
• Such a test can fail when the job is assigned to a
specific buildbot and it looks like a flaky test
CI host configuration can cause test failures

CI host performance: they're not made equal
• Hadoop's buildbot https://builds.apache.org/computer/

• Spark's buildbot https://amplab.cs.berkeley.edu/jenkins/computer/

• Significant difference in the response time!
• Maybe related to the fact that Spark has only a
small number of test-related issues
(e.g. YARN 63% vs Spark 6% (slide 7))
Target Average Max Min
Hadoop 1163ms 1482ms 30ms
Spark 3ms 6ms 0ms

Docker is great for testing!
• Some Apache software are using Docker on their
CI (via Apache Yetus)
• Apache BigTop also utilizes Docker for
provisioning Hadoop
• People also loves Docker for setting up test beds
on their workstations and laptops
• Of course me too
Docker issues

• Mentioned in several Apache-related issue tickets:
• jupyter/docker-stacks#75: Spark hanging
• docker-library/cassandra#43, #46
• docker-solr/docker-solr#4
• ALLURA-8039
• AMBARI-14706
• IGNITE-2377
• YETUS-229 …
• Fortunately Apache Buildbot (Yetus) didn't hit the bug,
but made people's local testbeds flaky in a weird way.
• Fixed in recent kernels (so, accurately, it's not a Docker's issue)
Docker #18180: Java VM unkillable zombie

AUFS: fcntl(F_SETFL, O_APPEND) was not supported
(#20199)
• Can cause data corruption (Dovecot is known to be affected)
• Fixed in recent AUFS
Overlay: You should not open O_RDWR and
O_RDONLY simultaneously (#10180)
• Can cause data corruption (RPM is known to be affected)
• Expected behavior, won't get fixed
More information: https://github.com/AkihiroSuda/docker-issues
Other potential Docker-related issues

• Some issues can occur only in a
deployed environment rather than in a
CI
• e.g. TCP packet corruption
• Very flaky and critical
Flaky test is not limited to xUnit in CI..
TCP

https://www.pagerduty.com/blog/the-discovery-of-apache-
zookeepers-poison-packet/
• TCP checksum was ignored in some IPsec
configuration
• ZooKeeper became weird intermittently due to corrupted TCP
packet
https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-
data-to-mesos-kubernetes-docker-containers-
4986f88f7a19#.gq8chzply
• TCP checksum was ignored in some veth
configuration
• Mesos and Kubernetes are affected
TCP packet corruption
TCP

• It's very hard to notice (and reproduce) flaky TCP
packet corruption...
• Should distributed systems be TCP-corruption
tolerant...?
• the probability is very low in regular environments,
but it is not zero
(32-bit Ethernet CRC + 16-bit TCP checksum)
• JIRA issues: ZOOKEEPER-2175, HDFS-8161…
TCP packet corruption
TCP

Agenda

• determine-flaky-tests-hadoop.py
• Apache Kudu‘s CI (dist_test)
• Google's TAP
• Our work: Namazu
https://github.com/osrg/Namazu
• and similar great tools
Efforts to find/reproduce a flaky test

• Picks up failed tests using Jenkins API
• Included in hadoop.git/dev-support (HADOOP-
11045)
determine-flaky-tests-hadoop.py
$ determine-flaky-tests-hadoop.py --job Hadoop-YARN-trunk
****Recently FAILED builds in url:
https://builds.apache.org/job/Hadoop-YARN-trunk
...
Among 15 runs examined, all failed tests <#failedRuns: testName>:
7: TestContainerManagerRecovery.testApplicationRecovery
...

• Great tool, but it doesn't support running a
specific test repeatedly
• Also there is a maven dependency issue (YARN-
4478)
• B depends on A
• TestB is never executed if TestA fails
 if TestA is flaky, we can't evaluate the flakiness of
TestB!
determine-flaky-tests-hadoop.py

Kudu's CI: flaky test dashboard
http://dist-test.cloudera.org:8080/ (Apr 25)
Recently open-sourced and introduced at Apache: Big Data (Monday)
https://github.com/cloudera/dist_test

Kudu's CI: flaky test dashboard
• Tests are run repeatedly on CI to find flaky tests
• KUDU_FLAKY_TEST_ATTEMPTS
• KUDU_FLAKY_TEST_LIST
(From https://github.com/apache/incubator-kudu/commit/1a24338a)
Fix flakiness of client_failover-itest
The reason this test was flaky is that there is a race between..
..
Looped 100x and they all passed:
http://dist-test.cloudera.org/job?job_id=mpercy.1454486819.10566
Author Mike Percy Jan 29, 2016 8:01 AM
Committer Todd Lipcon Feb 4, 2016 2:14 PM
Commit 1a24338ad60a8842d1ae5e227f8f03e58faea8c0

• Google's internal CI
• 1.6M test failures per day
• 73K (4.5%) are flaky
• Repeat a failing test 10 times for labeling
flaky tests
• Information source: An Empirical Analysis
of Flaky Tests (Q.Luo et al. ACM FSE'14)
Google's TAP

• Modern CIs run jobs repeatedly to find /
reproduce flaky tests
• But they don't control non-determinism
•  Overlook a flaky test
•  Can not reproduce a failure
 Cannot analyze the failure
• Our suggestion: increase non-determinism
for finding and reproducing flaky tests
Challenge: poor non-determinism

NAMAZU: PROGRAMMABLE FUZZY SCHEDULER
NOTE: Namazu was formerly named "Earthquake"

Namazu: programmable fuzzy scheduler
Event
Fuzzed (Randomized)
Schedule
Increases non-determinism
for finding and
reproducing flaky tests
Filesystem Packet Go[planned] Linux threadsJava
鯰(namazu) means
a catfish in Japanese

FUSE
Netfilter
Openflow
Byteman
AspectJ
AspectGo
[wip]
sched_
setattr(2)
Namazu uses non-invasive techniques
• can be easily applied to any environment
• can avoid false-positives
Namazu: programmable fuzzy scheduler
https://github.com/AkihiroSuda/golang-exp-aspectgo

• xUnit tests
• 😃 Easy to get started; just run `mvn`
• 😃 Can reproduce test failures observed in CI
• 😞 Limited testable scope
• Integration tests on a distributed cluster
• 😃 Can test everything
• 😞 Need to write a script to set up the cluster
• But Docker helps us a lot!
Namazu targets

We support the both scenarios
Namazu targets
Single-node mode
(for xUnit tests)
Distributed mode
(for integration tests)
$ mvn test
Orchestrator
RPC

NAMAZU + XUNIT TESTS
$ mvn test

• Namazu is a comprehensive framework...
• Quick start: “renice” threads for xUnit tests
• POSIX.1 requires that threads share the single nice(priority)
value, but the actual Linux implementation (NPTL) not.
• Not always effective, but it’s generic and easy to get started
Namazu + xUnit tests

Namazu + xUnit tests
$ PID=$(docker inspect $(docker ps -q -f ancestor=hadoop-
build-ubuntu) | jq .[0].State.Pid)
$ sudo nmz inspectors proc -pid $PID
$ cd hadoop; ./start-build-env.sh
[container]$ mvn test –Dtest=TestFoo#testBar
Namazu periodically sets random nice values for all the child
processes and the threads under $PID
Plus utilizes non-default kernel schedulers (e.g. SCHED_BATCH)

Namazu + xUnit tests: Reproducibility
Testcase Traditional Namazu
YARN-4548
RM/TestCapacityScheduler 11% 82%
YARN-4556
RM/TestFifoScheduler 2% 44%
ZOOKEEPER-2137
ReconfigTest 2% 16%
YARN-4168
NM/TestLogAggregationService 1% 8%
YARN-1978
NM/TestLogAggregationService 0% 4%
YARN-4543
NM/TestNodeStatusUpdater 0% 1%
• More information: osrg/namazu#125

Namazu + xUnit tests: Reproducibility
Testcase Traditional Namazu
ZOOKEEPER-2080
ReconfigRecoveryTest
14.0% 61.9%
• "Renicing" is not always effective...
• But even when renicing is ineffective,
sometimes you can also reproduce the flaky test
by injecting delays or reordering packets
$ sudo iptables ... -j NFQUEUE --queue-num 42
$ sudo nmz inspectors ethernet -nfq-number 42

NAMAZU + INTEGRATION TESTS

• ZooKeeper: distributed coordination service
• used in Hadoop, Spark, Mesos, Kafka..
• ZooKeeper 3.5 (alpha) introduced the dynamic
configuration
• We performed an integration test so as to evaluate
the reliability of the reconfiguration
• We found a flaky bug!
Namazu + Integration tests

• We permuted some specific Ethernet packets in random
order using Namazu
• TCP retransmissions are eliminated for reducing possible state
space
Namazu + Integration tests
ZooKeeper cluster
Open vSwitch + Ryu SDN Framework
+ Namazu

• Bug: New node cannot participate to ZK cluster properly
New node cannot become a leader of ZK cluster itself
(More technically, it keeps being an "observer“)
• Cause: distributed race (ZAB packet vs FLE packet)
• ZAB.. atomic broadcast protocol for data
• FLE.. leader election protocol for ZK cluster itself
Found ZOOKEEPER-2212
Leader of ZK cluster New ZK node
ZAB [2888/tcp]
FLE [3888/tcp]
Uses different TCP connection
Non-deterministic packet order

• Expected: ZK cluster works even when 𝑵/𝟐 nodes
crashed
• Real: single node failure can terminate the 3-node
ensemble
Not participating properly
(keeps being an "observer")

• Reproducibility: 0.0%  21.8%
(tested 1,000 times)
• We could not reproduce the bug even after
5,000 times traditional testing (60 hours!)
• Even reproducible by “renicing” threads, but the
reproducibility is just 0.7%
How hard is it to reproduce?

We define the distributed execution pattern based on code coverage:
𝑷 =
𝒑 𝟏,𝟏 ⋯ 𝒑 𝟏,𝑵
⋮ ⋱ ⋮
𝒑 𝑳,𝟏 ⋯ 𝒑 𝑳,𝑵
• 𝐿: LOC
• 𝑁: Number of nodes (==3 in this case)
• 𝑝 𝑖,𝑗: 1 if the node 𝑗 covers the branch in line 𝑖, otherwise 0
• We used JaCoCo: Java Code Coverage Library (patch: ZOOKEEPER-2266)
Why we can hit the bug?
Namazu achieves faster pattern growth.
That's why we can hit the bug.

HOW TO USE NAMAZU?

Easy to install
Easy to get started
• Provides Docker-like CLI
• No code instrumentation needed
• No configuration needed (default: just renice threads)
How to use Namazu?
$ sudo apt-get install lib{netfilter-queue,zmq3}-dev
$ go get github.com/osrg/namazu/nmz
$ sudo nmz container run –it –v /foo:/foo ubuntu
[container]$ cd /foo && mvn test

For threads ("renicing")
$ sudo nmz inspectors proc -pid $TARGET_PID
$ sudo nmz inspectors fs -mount-point /nmzfs
$ sudo iptables ... -j NFQUEUE --queue-num 42
$ sudo nmz inspectors ethernet -nfq-number 42
Need distributed mode? (for integration testing)
Just add `--orchestrator-url http://foobar:10080/api/v3` to the CLI.
For filesystem
For network packets
How to use Namazu?

Namazu API (Go)
type ExplorePolicy interface {
QueueEvent(Event)
ActionChan() chan Action
}
func (p *MyPolicy) QueueEvent(event Event) {
action := event.DefaultAction()
p.timeBoundedQ.Enqueue(action,
10 * Millisecond, 30 * Millisecond)
}
func (p *MyPolicy) ActionChan() chan Action {
return p.timeBoundedQ.DequeueChan
}
Action is randomly fired in [10ms, 30ms]
You can also inject fault actions here
Namazu defines REST API,
so you can also use other languages
An event can contain
Ethernet packet bytes

• We found a bug: YARN cannot detect disk failure cases
where mkdir()/rmdir() blocks
• We noticed that the bug can occur theoretically
when we are reading the code, and actually produced the
bug using Namazu
• When we should inject the fault is pre-known;
so we manually wrote a concrete scenario using Namazu API
• Much more realistic than JUnit + mocking
API use case: found YARN-4301
mkdir
EIO
mkdir
...
A case where mkdir() returns EIO explicitly A case where mkdir() blocks

func (p *MyPolicy) signalHandler() {
signal.Notify(sigChan, syscall.SIGUSR1)
for {
<-sigChan
p.sleep = 10 * time.Minute
}
}
go p.signalHandler()
func (p *MyPolicy) QueueEvent(event Event) {..}
func (p *MyPolicy) ActionChan() chan Action {..}
$ go run mypolicy.go inspectors fs -mount-point /nmzfs
Set "yarn.nodemanager.local-dirs" to "/nmzfs/nm-local-dir",
Send SIGUSR1 to Namazu when you (and YARN) are ready
Interactive test is often easier than writing a JUnit testcase
We use SIGUSR1 here,
but it is also interesting to
implement human-friendly
CLI or GUI for
interactive testing
fault: blocks for 10 minutes

• If you have knowledge on the protocol, you can make
a hash for a packet
• Note that you have to eliminate time-dependent and random
bytes when you hash the packet
• Using the hash and Namazu API, you can "semi"-
deterministically replay the scenario
• Not fully deterministic; it just does its best effort
• Record-less! You just need to remember the "seed" for
replaying
• PoC: ZOOKEEPER-2212: up to 65% reproducibility
• More information: osrg/namazu#137
• See also (for Go): https://github.com/AkihiroSuda/go-replay
Another API use case: "semi"-deterministic replay

SIMILAR GREAT TOOLS

• Network partitioner + Linearizability tester
• Famous for "Call Me Maybe" blog: http://jepsen.io/
• “Call Me Maybe” by Carly Rae Jepsen (vevo):
https://www.youtube.com/watch?v=fWNaR-rxAic
• Randomly injects network partition using iptables
• "Linearizability" ∈ "Strong consistency"
• Integration test on a flaky network rather than a
flaky xUnit test
Similar great tool: Jepsen

• Has been used to test several Apache software
• Cassandra: 9851,10001,10068,10231,10413,10674
• http://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
• HBase
• Kafka
• Solr: 6530, 6583, 6610
• http:///lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-
flaky-networks
• ZooKeeper
Similar great tool: Jepsen

• Namazu is much more generalized
• The bugs we found/reproduced are basically beyond the
scope of Jepsen (Threads, Disks..)
• Namazu can be also combined with Jepsen! It will be
our next work..
Namazu + Jepsen?
• causes network partition
• tests linearizablity
• increases non-determinism
• injects filesystem faults
Jepsen Namazu ...

• Make the filesystem flaky using FUSE
• Used in testing ScyllaDB (Apache Cassandra's clone)
• https://github.com/scylladb/charybdefs
• Similar to Namazu FS
• Both supports API
• Also similar to PetardFS (not active since 2007)
• CharybdeFS can be also combined with Namazu as
well
• CharybdeFS is specialized in FS; Namazu is much more
comprehensive.
Similar great tool: CharybdeFS

https://github.com/NetSys/demi
• Found some akka-raft bugs and reproduced a few Spark bugs
• challenge in reducing false-positives related to instrumentation
• DEMi and Namazu are complementary each other
• DEMi is powerful, but has some limitations
• Namazu is comprehensive and made easy to get started
Similar great tool: DEMi (appeared in NSDI'16)
Namazu DEMi
Target Generic
(Network,Filesystem,Thread..)
Akka
Getting Started Easy Need to write
AspectJ codes
Deterministic Replay? No Yes
Bug Cause Minimization? No Yes

SO... HOW CAN WE FIX FLAKY TESTS?

• Namazu finds/reproduces flaky tests, but it
doesn't automatically fix them 😞
• Basic approach for async-related flakiness:
Adjust the values for sleep() and retries in the
test code
How can we fix flaky tests?

How can we fix flaky tests?
• Suggestion: the timeout(&retries) should be a configurable
parameter rather than a hard-coded value
Timeout value Cost
(time)
Risk (timeout) Appropriate for
Long High Low • Slow machine (e.g.CI)
• Conservative person
Short Low High • Fast machine
• Risk-appetite person

CONCLUSION

• Apache software are well tested
• But they are flaky
• Let’s improve them
• Improve asynchronous code
• Repeat tests
• Our tool can control non-determinism
so as to reproduce flaky tests
Conclusion

Flaky tests and bugs in Apache software (e.g. Hadoop)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flaky tests and bugs in Apache software (e.g. Hadoop)

Similar to Flaky tests and bugs in Apache software (e.g. Hadoop) (20)

More from Akihiro Suda

More from Akihiro Suda (20)

Recently uploaded

Recently uploaded (20)

Flaky tests and bugs in Apache software (e.g. Hadoop)