SlideShare a Scribd company logo
Copyright© 2016 NTT Corp. All Rights Reserved.
Flaky Tests and Bugs in
Apache Software (e.g. Hadoop)
Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
NTT Software Innovation Center
ApacheCon Core North America (May 12, 2016, at Vancouver)
2Copyright© 2016 NTT Corp. All Rights Reserved.
• Software Engineer at NTT Corporation
• NTT: the largest telecom in Japan
• Engaged in improvement on reliability of
distributed systems
• Some contributions to ZooKeeper / Hadoop
including critical bug fixes (non-committer)
• github: https://github.com/AkihiroSuda
Who am I
3Copyright© 2016 NTT Corp. All Rights Reserved.
• Current "flakiness" in Apache software
• Why flaky test matters?
• What causes a flaky test?
• How can we find, reproduce, and fix a flaky test?
• Existing work at Apache communities
• Our work: Namazu(鯰, catfish)
https://github.com/osrg/namazu
Agenda
4Copyright© 2016 NTT Corp. All Rights Reserved.
Agenda
• Current "flakiness" in Apache software
• Why flaky test matters?
• What causes a flaky test?
• How can we find, reproduce, and fix a flaky test?
• Existing work at Apache communities
• Our work: Namazu(鯰, catfish)
https://github.com/osrg/namazu
5Copyright© 2016 NTT Corp. All Rights Reserved.
Good News: Apache software are well tested!
Software Production code (LOC) Test code (LOC)
MapReduce 95K 87K
YARN 178K 121K
HDFS 152K 150K
ZooKeeper 33K 27K
HBase 571K 222K
Spark 167K 128K
Flume 46K 34K
Cassandra 168K 78K
Data are measured at 14/01/2016, using CLOC
Prod Test
6Copyright© 2016 NTT Corp. All Rights Reserved.
Bad News: https://builds.apache.org/job/%s-trunk/
MapReduce YARN HDFS
ZooKeeper
Data are captured at 14/01/2016
HBase Build
Build Time
Blue = Success
Red = Failure
I've never seen fully successful Hadoop build,
even on my local machine...
7Copyright© 2016 NTT Corp. All Rights Reserved.
Bad News: JIRA QL: project = ? AND text ~ "test fail*"
Software #Matched #All
Issues
MapReduce 2,441 (38%) 6,373
YARN 2,290 (63%) 4,756
HDFS 5,141 (53%) 9,672
ZooKeeper 828 (35%) 2,384
HBase 6,595 (42%) 15,542
Spark 794 ( 6%) 14,047
Flume 342 (12%) 2,882
Cassandra 1,656 (15%) 11,430
Data are captured at 4/4/2016
Roughly speaking,
the half of
Hadoop development
is dedicated to
debugging test failures.
Interestingly,
its flakiness seems
not uniform
across software..
(discussed later)
just for approximation
8Copyright© 2016 NTT Corp. All Rights Reserved.
Agenda
• Current "flakiness" in Apache software
• Why flaky test matters?
• What causes a flaky test?
• How can we find, reproduce, and fix a flaky test?
• Existing work at Apache communities
• Our work: Namazu(鯰, catfish)
https://github.com/osrg/namazu
9Copyright© 2016 NTT Corp. All Rights Reserved.
97% unit test failures in Apache software are said to be
harmless for production ("false-alarm")
• Information source:
"An Empirical Study of Bugs in Test Code" (A.Vahabzadeh et al., ICSME'15)
Not all test failures are critical for production..
10Copyright© 2016 NTT Corp. All Rights Reserved.
It still matters!
For developers..
It's a barrier to promotion of CI
• If many tests are flaky, developers tend to ignore CI
failure  overlook real bugs
It's also a psychological barrier to contribution
• A developer may be blamed due to a test failure
For users..
It's a barrier to risk assessment for production
• No one can tell flaky tests from real bugs
So flaky test doesn't matter, as it doesn't affect production?
11Copyright© 2016 NTT Corp. All Rights Reserved.
SemaphoreCI suggests "No broken windows" strategy
for flaky tests
https://semaphoreci.com/community/tutorials/how-to-deal-with-and-eliminate-flaky-tests
So flaky test doesn't matter, as it doesn't affect production?
image: http://guides.lib.jjay.cuny.edu/nypd/brokenwindows
12Copyright© 2016 NTT Corp. All Rights Reserved.
Agenda
• Current "flakiness" in Apache software
• Why flaky test matters?
• What causes a flaky test?
• How can we find, reproduce, and fix a flaky test?
• Existing work at Apache communities
• Our work: Namazu(鯰, catfish)
https://github.com/osrg/namazu
13Copyright© 2016 NTT Corp. All Rights Reserved.
• Typical flaky test is caused by a malformed async
operation like this
(A.Vahabzadeh et al., ICSME'15 / Q.Luo et al., ACM FSE'14 / YARN-4478)
• Basically it can be fixed by increasing timeout&retries
• But it's not easy to find a reasonable timeout value
(e.g. YARN-{4804, 4807, 4929...})
• Long timeout is expensive
Basic cause: async operation
invokeAsyncOperation();
// some tests lack even this sleep
sleep(certainHardcodedTimeout);
assertTrue(checkSomethingGoodHasHappened());
14Copyright© 2016 NTT Corp. All Rights Reserved.
• Host configuration
• Host performance
• Docker is great! But it still has some
issues
Testbed (e.g. CI) can cause test failures as well
15Copyright© 2016 NTT Corp. All Rights Reserved.
• HADOOP-12687
• Many YARN test fails when /etc/hosts has multiple loopback
entries
• ZOOKEEPER-2252
• Test: nslookup("a") should fail
• It does not fail when there is actually the host named "a“
• INFRA-11811
• JDK was not set up properly in a Jenkins slave
• Such a test can fail when the job is assigned to a
specific buildbot and it looks like a flaky test
CI host configuration can cause test failures
16Copyright© 2016 NTT Corp. All Rights Reserved.
CI host performance: they're not made equal
• Hadoop's buildbot https://builds.apache.org/computer/
Data are captured at 25/04/2016
17Copyright© 2016 NTT Corp. All Rights Reserved.
CI host performance: they're not made equal
• Spark's buildbot https://amplab.cs.berkeley.edu/jenkins/computer/
18Copyright© 2016 NTT Corp. All Rights Reserved.
CI host performance: they're not made equal
• Significant difference in the response time!
• Maybe related to the fact that Spark has only a
small number of test-related issues
(e.g. YARN 63% vs Spark 6% (slide 7))
Target Average Max Min
Hadoop 1163ms 1482ms 30ms
Spark 3ms 6ms 0ms
19Copyright© 2016 NTT Corp. All Rights Reserved.
Docker is great for testing!
• Some Apache software are using Docker on their
CI (via Apache Yetus)
• Apache BigTop also utilizes Docker for
provisioning Hadoop
• People also loves Docker for setting up test beds
on their workstations and laptops
• Of course me too
Docker issues
20Copyright© 2016 NTT Corp. All Rights Reserved.
• Mentioned in several Apache-related issue tickets:
• jupyter/docker-stacks#75: Spark hanging
• docker-library/cassandra#43, #46
• docker-solr/docker-solr#4
• ALLURA-8039
• AMBARI-14706
• IGNITE-2377
• YETUS-229 …
• Fortunately Apache Buildbot (Yetus) didn't hit the bug,
but made people's local testbeds flaky in a weird way.
• Fixed in recent kernels (so, accurately, it's not a Docker's issue)
Docker #18180: Java VM unkillable zombie
21Copyright© 2016 NTT Corp. All Rights Reserved.
AUFS: fcntl(F_SETFL, O_APPEND) was not supported
(#20199)
• Can cause data corruption (Dovecot is known to be affected)
• Fixed in recent AUFS
Overlay: You should not open O_RDWR and
O_RDONLY simultaneously (#10180)
• Can cause data corruption (RPM is known to be affected)
• Expected behavior, won't get fixed
More information: https://github.com/AkihiroSuda/docker-issues
Other potential Docker-related issues
22Copyright© 2016 NTT Corp. All Rights Reserved.
• Some issues can occur only in a
deployed environment rather than in a
CI
• e.g. TCP packet corruption
• Very flaky and critical
Flaky test is not limited to xUnit in CI..
TCP
23Copyright© 2016 NTT Corp. All Rights Reserved.
https://www.pagerduty.com/blog/the-discovery-of-apache-
zookeepers-poison-packet/
• TCP checksum was ignored in some IPsec
configuration
• ZooKeeper became weird intermittently due to corrupted TCP
packet
https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-
data-to-mesos-kubernetes-docker-containers-
4986f88f7a19#.gq8chzply
• TCP checksum was ignored in some veth
configuration
• Mesos and Kubernetes are affected
TCP packet corruption
TCP
24Copyright© 2016 NTT Corp. All Rights Reserved.
• It's very hard to notice (and reproduce) flaky TCP
packet corruption...
• Should distributed systems be TCP-corruption
tolerant...?
• the probability is very low in regular environments,
but it is not zero
(32-bit Ethernet CRC + 16-bit TCP checksum)
• JIRA issues: ZOOKEEPER-2175, HDFS-8161…
TCP packet corruption
TCP
25Copyright© 2016 NTT Corp. All Rights Reserved.
Agenda
• Current "flakiness" in Apache software
• Why flaky test matters?
• What causes a flaky test?
• How can we find, reproduce, and fix a flaky test?
• Existing work at Apache communities
• Our work: Namazu(鯰, catfish)
https://github.com/osrg/namazu
26Copyright© 2016 NTT Corp. All Rights Reserved.
• determine-flaky-tests-hadoop.py
• Apache Kudu‘s CI (dist_test)
• Google's TAP
• Our work: Namazu
https://github.com/osrg/Namazu
• and similar great tools
Efforts to find/reproduce a flaky test
27Copyright© 2016 NTT Corp. All Rights Reserved.
• Picks up failed tests using Jenkins API
• Included in hadoop.git/dev-support (HADOOP-
11045)
determine-flaky-tests-hadoop.py
$ determine-flaky-tests-hadoop.py --job Hadoop-YARN-trunk
****Recently FAILED builds in url:
https://builds.apache.org/job/Hadoop-YARN-trunk
...
Among 15 runs examined, all failed tests <#failedRuns: testName>:
7: TestContainerManagerRecovery.testApplicationRecovery
...
28Copyright© 2016 NTT Corp. All Rights Reserved.
• Great tool, but it doesn't support running a
specific test repeatedly
• Also there is a maven dependency issue (YARN-
4478)
• B depends on A
• TestB is never executed if TestA fails
 if TestA is flaky, we can't evaluate the flakiness of
TestB!
determine-flaky-tests-hadoop.py
29Copyright© 2016 NTT Corp. All Rights Reserved.
Kudu's CI: flaky test dashboard
http://dist-test.cloudera.org:8080/ (Apr 25)
Recently open-sourced and introduced at Apache: Big Data (Monday)
https://github.com/cloudera/dist_test
30Copyright© 2016 NTT Corp. All Rights Reserved.
Kudu's CI: flaky test dashboard
• Tests are run repeatedly on CI to find flaky tests
• KUDU_FLAKY_TEST_ATTEMPTS
• KUDU_FLAKY_TEST_LIST
(From https://github.com/apache/incubator-kudu/commit/1a24338a)
Fix flakiness of client_failover-itest
The reason this test was flaky is that there is a race between..
..
Looped 100x and they all passed:
http://dist-test.cloudera.org/job?job_id=mpercy.1454486819.10566
Author Mike Percy Jan 29, 2016 8:01 AM
Committer Todd Lipcon Feb 4, 2016 2:14 PM
Commit 1a24338ad60a8842d1ae5e227f8f03e58faea8c0
31Copyright© 2016 NTT Corp. All Rights Reserved.
• Google's internal CI
• 1.6M test failures per day
• 73K (4.5%) are flaky
• Repeat a failing test 10 times for labeling
flaky tests
• Information source: An Empirical Analysis
of Flaky Tests (Q.Luo et al. ACM FSE'14)
Google's TAP
32Copyright© 2016 NTT Corp. All Rights Reserved.
• Modern CIs run jobs repeatedly to find /
reproduce flaky tests
• But they don't control non-determinism
•  Overlook a flaky test
•  Can not reproduce a failure
 Cannot analyze the failure
• Our suggestion: increase non-determinism
for finding and reproducing flaky tests
Challenge: poor non-determinism
33Copyright© 2016 NTT Corp. All Rights Reserved.
NAMAZU: PROGRAMMABLE FUZZY SCHEDULER
https://github.com/osrg/namazu
NOTE: Namazu was formerly named "Earthquake"
34Copyright© 2016 NTT Corp. All Rights Reserved.
Namazu: programmable fuzzy scheduler
https://github.com/osrg/namazu
Event
Fuzzed (Randomized)
Schedule
Increases non-determinism
for finding and
reproducing flaky tests
Filesystem Packet Go[planned] Linux threadsJava
鯰(namazu) means
a catfish in Japanese
35Copyright© 2016 NTT Corp. All Rights Reserved.
FUSE
Netfilter
Openflow
Byteman
AspectJ
Filesystem Packet Go[planned] Linux threadsJava
AspectGo
[wip]
sched_
setattr(2)
Namazu uses non-invasive techniques
• can be easily applied to any environment
• can avoid false-positives
Namazu: programmable fuzzy scheduler
https://github.com/osrg/namazu
https://github.com/AkihiroSuda/golang-exp-aspectgo
36Copyright© 2016 NTT Corp. All Rights Reserved.
• xUnit tests
• 😃 Easy to get started; just run `mvn`
• 😃 Can reproduce test failures observed in CI
• 😞 Limited testable scope
• Integration tests on a distributed cluster
• 😃 Can test everything
• 😞 Need to write a script to set up the cluster
• But Docker helps us a lot!
Namazu targets
37Copyright© 2016 NTT Corp. All Rights Reserved.
We support the both scenarios
Namazu targets
Single-node mode
(for xUnit tests)
Distributed mode
(for integration tests)
$ mvn test
Orchestrator
RPC
38Copyright© 2016 NTT Corp. All Rights Reserved.
NAMAZU + XUNIT TESTS
$ mvn test
39Copyright© 2016 NTT Corp. All Rights Reserved.
• Namazu is a comprehensive framework...
• Quick start: “renice” threads for xUnit tests
• POSIX.1 requires that threads share the single nice(priority)
value, but the actual Linux implementation (NPTL) not.
• Not always effective, but it’s generic and easy to get started
Namazu + xUnit tests
Filesystem Packet Go[planned] Linux threadsJava
40Copyright© 2016 NTT Corp. All Rights Reserved.
Namazu + xUnit tests
$ PID=$(docker inspect $(docker ps -q -f ancestor=hadoop-
build-ubuntu) | jq .[0].State.Pid)
$ sudo nmz inspectors proc -pid $PID
$ cd hadoop; ./start-build-env.sh
[container]$ mvn test –Dtest=TestFoo#testBar
Namazu periodically sets random nice values for all the child
processes and the threads under $PID
Plus utilizes non-default kernel schedulers (e.g. SCHED_BATCH)
41Copyright© 2016 NTT Corp. All Rights Reserved.
Namazu + xUnit tests: Reproducibility
Testcase Traditional Namazu
YARN-4548
RM/TestCapacityScheduler 11% 82%
YARN-4556
RM/TestFifoScheduler 2% 44%
ZOOKEEPER-2137
ReconfigTest 2% 16%
YARN-4168
NM/TestLogAggregationService 1% 8%
YARN-1978
NM/TestLogAggregationService 0% 4%
YARN-4543
NM/TestNodeStatusUpdater 0% 1%
• More information: osrg/namazu#125
42Copyright© 2016 NTT Corp. All Rights Reserved.
Namazu + xUnit tests: Reproducibility
Testcase Traditional Namazu
ZOOKEEPER-2080
ReconfigRecoveryTest
14.0% 61.9%
• "Renicing" is not always effective...
• But even when renicing is ineffective,
sometimes you can also reproduce the flaky test
by injecting delays or reordering packets
$ sudo iptables ... -j NFQUEUE --queue-num 42
$ sudo nmz inspectors ethernet -nfq-number 42
43Copyright© 2016 NTT Corp. All Rights Reserved.
NAMAZU + INTEGRATION TESTS
44Copyright© 2016 NTT Corp. All Rights Reserved.
• ZooKeeper: distributed coordination service
• used in Hadoop, Spark, Mesos, Kafka..
• ZooKeeper 3.5 (alpha) introduced the dynamic
configuration
• We performed an integration test so as to evaluate
the reliability of the reconfiguration
• We found a flaky bug!
Namazu + Integration tests
45Copyright© 2016 NTT Corp. All Rights Reserved.
• We permuted some specific Ethernet packets in random
order using Namazu
• TCP retransmissions are eliminated for reducing possible state
space
Namazu + Integration tests
ZooKeeper cluster
Open vSwitch + Ryu SDN Framework
+ Namazu
46Copyright© 2016 NTT Corp. All Rights Reserved.
• Bug: New node cannot participate to ZK cluster properly
New node cannot become a leader of ZK cluster itself
(More technically, it keeps being an "observer“)
• Cause: distributed race (ZAB packet vs FLE packet)
• ZAB.. atomic broadcast protocol for data
• FLE.. leader election protocol for ZK cluster itself
Found ZOOKEEPER-2212
Leader of ZK cluster New ZK node
ZAB [2888/tcp]
FLE [3888/tcp]
Uses different TCP connection
Non-deterministic packet order
47Copyright© 2016 NTT Corp. All Rights Reserved.
Data are captured at 22/01/2016
Found ZOOKEEPER-2212
48Copyright© 2016 NTT Corp. All Rights Reserved.
• Expected: ZK cluster works even when 𝑵/𝟐 nodes
crashed
• Real: single node failure can terminate the 3-node
ensemble
Found ZOOKEEPER-2212
Not participating properly
(keeps being an "observer")
49Copyright© 2016 NTT Corp. All Rights Reserved.
• Reproducibility: 0.0%  21.8%
(tested 1,000 times)
• We could not reproduce the bug even after
5,000 times traditional testing (60 hours!)
• Even reproducible by “renicing” threads, but the
reproducibility is just 0.7%
How hard is it to reproduce?
50Copyright© 2016 NTT Corp. All Rights Reserved.
We define the distributed execution pattern based on code coverage:
𝑷 =
𝒑 𝟏,𝟏 ⋯ 𝒑 𝟏,𝑵
⋮ ⋱ ⋮
𝒑 𝑳,𝟏 ⋯ 𝒑 𝑳,𝑵
• 𝐿: LOC
• 𝑁: Number of nodes (==3 in this case)
• 𝑝 𝑖,𝑗: 1 if the node 𝑗 covers the branch in line 𝑖, otherwise 0
• We used JaCoCo: Java Code Coverage Library (patch: ZOOKEEPER-2266)
Why we can hit the bug?
Namazu achieves faster pattern growth.
That's why we can hit the bug.
51Copyright© 2016 NTT Corp. All Rights Reserved.
HOW TO USE NAMAZU?
52Copyright© 2016 NTT Corp. All Rights Reserved.
Easy to install
Easy to get started
• Provides Docker-like CLI
• No code instrumentation needed
• No configuration needed (default: just renice threads)
How to use Namazu?
$ sudo apt-get install lib{netfilter-queue,zmq3}-dev
$ go get github.com/osrg/namazu/nmz
$ sudo nmz container run –it –v /foo:/foo ubuntu
[container]$ cd /foo && mvn test
53Copyright© 2016 NTT Corp. All Rights Reserved.
For threads ("renicing")
$ sudo nmz inspectors proc -pid $TARGET_PID
$ sudo nmz inspectors fs -mount-point /nmzfs
$ sudo iptables ... -j NFQUEUE --queue-num 42
$ sudo nmz inspectors ethernet -nfq-number 42
Need distributed mode? (for integration testing)
Just add `--orchestrator-url http://foobar:10080/api/v3` to the CLI.
For filesystem
For network packets
How to use Namazu?
54Copyright© 2016 NTT Corp. All Rights Reserved.
Namazu API (Go)
type ExplorePolicy interface {
QueueEvent(Event)
ActionChan() chan Action
}
func (p *MyPolicy) QueueEvent(event Event) {
action := event.DefaultAction()
p.timeBoundedQ.Enqueue(action,
10 * Millisecond, 30 * Millisecond)
}
func (p *MyPolicy) ActionChan() chan Action {
return p.timeBoundedQ.DequeueChan
}
Action is randomly fired in [10ms, 30ms]
You can also inject fault actions here
Namazu defines REST API,
so you can also use other languages
An event can contain
Ethernet packet bytes
55Copyright© 2016 NTT Corp. All Rights Reserved.
• We found a bug: YARN cannot detect disk failure cases
where mkdir()/rmdir() blocks
• We noticed that the bug can occur theoretically
when we are reading the code, and actually produced the
bug using Namazu
• When we should inject the fault is pre-known;
so we manually wrote a concrete scenario using Namazu API
• Much more realistic than JUnit + mocking
API use case: found YARN-4301
mkdir
EIO
mkdir
...
A case where mkdir() returns EIO explicitly A case where mkdir() blocks
56Copyright© 2016 NTT Corp. All Rights Reserved.
func (p *MyPolicy) signalHandler() {
signal.Notify(sigChan, syscall.SIGUSR1)
for {
<-sigChan
p.sleep = 10 * time.Minute
}
}
go p.signalHandler()
func (p *MyPolicy) QueueEvent(event Event) {..}
func (p *MyPolicy) ActionChan() chan Action {..}
$ go run mypolicy.go inspectors fs -mount-point /nmzfs
Set "yarn.nodemanager.local-dirs" to "/nmzfs/nm-local-dir",
Send SIGUSR1 to Namazu when you (and YARN) are ready
Interactive test is often easier than writing a JUnit testcase
We use SIGUSR1 here,
but it is also interesting to
implement human-friendly
CLI or GUI for
interactive testing
fault: blocks for 10 minutes
API use case: found YARN-4301
57Copyright© 2016 NTT Corp. All Rights Reserved.
API use case: found YARN-4301
58Copyright© 2016 NTT Corp. All Rights Reserved.
• If you have knowledge on the protocol, you can make
a hash for a packet
• Note that you have to eliminate time-dependent and random
bytes when you hash the packet
• Using the hash and Namazu API, you can "semi"-
deterministically replay the scenario
• Not fully deterministic; it just does its best effort
• Record-less! You just need to remember the "seed" for
replaying
• PoC: ZOOKEEPER-2212: up to 65% reproducibility
• More information: osrg/namazu#137
• See also (for Go): https://github.com/AkihiroSuda/go-replay
Another API use case: "semi"-deterministic replay
59Copyright© 2016 NTT Corp. All Rights Reserved.
SIMILAR GREAT TOOLS
60Copyright© 2016 NTT Corp. All Rights Reserved.
• Network partitioner + Linearizability tester
• Famous for "Call Me Maybe" blog: http://jepsen.io/
• “Call Me Maybe” by Carly Rae Jepsen (vevo):
https://www.youtube.com/watch?v=fWNaR-rxAic
• Randomly injects network partition using iptables
• "Linearizability" ∈ "Strong consistency"
• Integration test on a flaky network rather than a
flaky xUnit test
Similar great tool: Jepsen
61Copyright© 2016 NTT Corp. All Rights Reserved.
• Has been used to test several Apache software
• Cassandra: 9851,10001,10068,10231,10413,10674
• http://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
• HBase
• Kafka
• Solr: 6530, 6583, 6610
• http:///lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-
flaky-networks
• ZooKeeper
Similar great tool: Jepsen
62Copyright© 2016 NTT Corp. All Rights Reserved.
• Namazu is much more generalized
• The bugs we found/reproduced are basically beyond the
scope of Jepsen (Threads, Disks..)
• Namazu can be also combined with Jepsen! It will be
our next work..
Namazu + Jepsen?
• causes network partition
• tests linearizablity
• increases non-determinism
• injects filesystem faults
Jepsen Namazu ...
63Copyright© 2016 NTT Corp. All Rights Reserved.
• Make the filesystem flaky using FUSE
• Used in testing ScyllaDB (Apache Cassandra's clone)
• https://github.com/scylladb/charybdefs
• Similar to Namazu FS
• Both supports API
• Also similar to PetardFS (not active since 2007)
• CharybdeFS can be also combined with Namazu as
well
• CharybdeFS is specialized in FS; Namazu is much more
comprehensive.
Similar great tool: CharybdeFS
64Copyright© 2016 NTT Corp. All Rights Reserved.
https://github.com/NetSys/demi
• Found some akka-raft bugs and reproduced a few Spark bugs
• challenge in reducing false-positives related to instrumentation
• DEMi and Namazu are complementary each other
• DEMi is powerful, but has some limitations
• Namazu is comprehensive and made easy to get started
Similar great tool: DEMi (appeared in NSDI'16)
Namazu DEMi
Target Generic
(Network,Filesystem,Thread..)
Akka
Getting Started Easy Need to write
AspectJ codes
Deterministic Replay? No Yes
Bug Cause Minimization? No Yes
65Copyright© 2016 NTT Corp. All Rights Reserved.
SO... HOW CAN WE FIX FLAKY TESTS?
66Copyright© 2016 NTT Corp. All Rights Reserved.
• Namazu finds/reproduces flaky tests, but it
doesn't automatically fix them 😞
• Basic approach for async-related flakiness:
Adjust the values for sleep() and retries in the
test code
How can we fix flaky tests?
invokeAsyncOperation();
// some tests lack even this sleep
sleep(certainHardcodedTimeout);
assertTrue(checkSomethingGoodHasHappened());
67Copyright© 2016 NTT Corp. All Rights Reserved.
How can we fix flaky tests?
invokeAsyncOperation();
// some tests lack even this sleep
sleep(certainHardcodedTimeout);
assertTrue(checkSomethingGoodHasHappened());
• Suggestion: the timeout(&retries) should be a configurable
parameter rather than a hard-coded value
Timeout value Cost
(time)
Risk (timeout) Appropriate for
Long High Low • Slow machine (e.g.CI)
• Conservative person
Short Low High • Fast machine
• Risk-appetite person
68Copyright© 2016 NTT Corp. All Rights Reserved.
CONCLUSION
69Copyright© 2016 NTT Corp. All Rights Reserved.
• Apache software are well tested
• But they are flaky
• Let’s improve them
• Improve asynchronous code
• Repeat tests
• Our tool can control non-determinism
so as to reproduce flaky tests
https://github.com/osrg/namazu
Conclusion

More Related Content

What's hot

Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image Distribution
Kohei Tokunaga
 
Monitoring system for OpenStack,using a OSS products
Monitoring system for OpenStack,using a OSS productsMonitoring system for OpenStack,using a OSS products
Monitoring system for OpenStack,using a OSS products
satsuki fukazu
 
DockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐるDockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐる
Kohei Tokunaga
 
Java applications containerized and deployed
Java applications containerized and deployedJava applications containerized and deployed
Java applications containerized and deployed
Anthony Dahanne
 
Daneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver Meetup
Daneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver MeetupDaneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver Meetup
Daneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver MeetupShannon McFarland
 
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterThe overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
Kohei Tokunaga
 
HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月
HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月
HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月
VirtualTech Japan Inc.
 
Summit 16: The Open Source NFV Eco-system and OPNFV's Role Therein
Summit 16: The Open Source NFV Eco-system and OPNFV's Role ThereinSummit 16: The Open Source NFV Eco-system and OPNFV's Role Therein
Summit 16: The Open Source NFV Eco-system and OPNFV's Role Therein
OPNFV
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能
Kohei Tokunaga
 
eStargzイメージとlazy pullingによる高速なコンテナ起動
eStargzイメージとlazy pullingによる高速なコンテナ起動eStargzイメージとlazy pullingによる高速なコンテナ起動
eStargzイメージとlazy pullingによる高速なコンテナ起動
Kohei Tokunaga
 
BuildKitでLazy Pullを有効にしてビルドを早くする話
BuildKitでLazy Pullを有効にしてビルドを早くする話BuildKitでLazy Pullを有効にしてビルドを早くする話
BuildKitでLazy Pullを有効にしてビルドを早くする話
Kohei Tokunaga
 
SCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with ChefSCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with Chef
Matt Ray
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
Evans Ye
 
Automated Deployment & Benchmarking with Chef, Cobbler and Rally for OpenStack
Automated Deployment & Benchmarking with Chef, Cobbler and Rally for OpenStackAutomated Deployment & Benchmarking with Chef, Cobbler and Rally for OpenStack
Automated Deployment & Benchmarking with Chef, Cobbler and Rally for OpenStack
NTT Communications Technology Development
 
"OpenCV for Embedded: Lessons Learned," a Presentation from itseez
"OpenCV for Embedded: Lessons Learned," a Presentation from itseez"OpenCV for Embedded: Lessons Learned," a Presentation from itseez
"OpenCV for Embedded: Lessons Learned," a Presentation from itseez
Edge AI and Vision Alliance
 
DevOps - Interview Question.pdf
DevOps - Interview Question.pdfDevOps - Interview Question.pdf
DevOps - Interview Question.pdf
MinhTrnNht7
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into Containerd
Kohei Tokunaga
 
OpenStack Swiftの最新機能とStorlets
OpenStack Swiftの最新機能とStorletsOpenStack Swiftの最新機能とStorlets
OpenStack Swiftの最新機能とStorlets
Kota Tsuyuzaki
 
Smart Testing: Catching More Bugs with Less Code Through Topology Shuffler
Smart Testing: Catching More Bugs with Less Code Through Topology ShufflerSmart Testing: Catching More Bugs with Less Code Through Topology Shuffler
Smart Testing: Catching More Bugs with Less Code Through Topology Shuffler
OPNFV
 
PyParis2018 - Python tooling for continuous deployment
PyParis2018 - Python tooling for continuous deploymentPyParis2018 - Python tooling for continuous deployment
PyParis2018 - Python tooling for continuous deployment
Arthur Lutz
 

What's hot (20)

Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image Distribution
 
Monitoring system for OpenStack,using a OSS products
Monitoring system for OpenStack,using a OSS productsMonitoring system for OpenStack,using a OSS products
Monitoring system for OpenStack,using a OSS products
 
DockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐるDockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐる
 
Java applications containerized and deployed
Java applications containerized and deployedJava applications containerized and deployed
Java applications containerized and deployed
 
Daneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver Meetup
Daneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver MeetupDaneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver Meetup
Daneyon Hansen - Intro to OpenStack - Feb13 OpenStack Denver Meetup
 
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterThe overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
 
HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月
HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月
HPNFVの取組みとMWC2015 – OpenStack最新情報セミナー 2015年4月
 
Summit 16: The Open Source NFV Eco-system and OPNFV's Role Therein
Summit 16: The Open Source NFV Eco-system and OPNFV's Role ThereinSummit 16: The Open Source NFV Eco-system and OPNFV's Role Therein
Summit 16: The Open Source NFV Eco-system and OPNFV's Role Therein
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能
 
eStargzイメージとlazy pullingによる高速なコンテナ起動
eStargzイメージとlazy pullingによる高速なコンテナ起動eStargzイメージとlazy pullingによる高速なコンテナ起動
eStargzイメージとlazy pullingによる高速なコンテナ起動
 
BuildKitでLazy Pullを有効にしてビルドを早くする話
BuildKitでLazy Pullを有効にしてビルドを早くする話BuildKitでLazy Pullを有効にしてビルドを早くする話
BuildKitでLazy Pullを有効にしてビルドを早くする話
 
SCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with ChefSCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with Chef
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
Automated Deployment & Benchmarking with Chef, Cobbler and Rally for OpenStack
Automated Deployment & Benchmarking with Chef, Cobbler and Rally for OpenStackAutomated Deployment & Benchmarking with Chef, Cobbler and Rally for OpenStack
Automated Deployment & Benchmarking with Chef, Cobbler and Rally for OpenStack
 
"OpenCV for Embedded: Lessons Learned," a Presentation from itseez
"OpenCV for Embedded: Lessons Learned," a Presentation from itseez"OpenCV for Embedded: Lessons Learned," a Presentation from itseez
"OpenCV for Embedded: Lessons Learned," a Presentation from itseez
 
DevOps - Interview Question.pdf
DevOps - Interview Question.pdfDevOps - Interview Question.pdf
DevOps - Interview Question.pdf
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into Containerd
 
OpenStack Swiftの最新機能とStorlets
OpenStack Swiftの最新機能とStorletsOpenStack Swiftの最新機能とStorlets
OpenStack Swiftの最新機能とStorlets
 
Smart Testing: Catching More Bugs with Less Code Through Topology Shuffler
Smart Testing: Catching More Bugs with Less Code Through Topology ShufflerSmart Testing: Catching More Bugs with Less Code Through Topology Shuffler
Smart Testing: Catching More Bugs with Less Code Through Topology Shuffler
 
PyParis2018 - Python tooling for continuous deployment
PyParis2018 - Python tooling for continuous deploymentPyParis2018 - Python tooling for continuous deployment
PyParis2018 - Python tooling for continuous deployment
 

Similar to Flaky tests and bugs in Apache software (e.g. Hadoop)

HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
Yuji Kubota
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
 
Preparing your code for Java 9
Preparing your code for Java 9Preparing your code for Java 9
Preparing your code for Java 9
Deepu Xavier
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
Nicola Ferraro
 
Ceph Performance on OpenStack - Barcelona Summit
Ceph Performance on OpenStack - Barcelona SummitCeph Performance on OpenStack - Barcelona Summit
Ceph Performance on OpenStack - Barcelona Summit
Takehiro Kudou
 
Arakoon: A distributed consistent key-value store
Arakoon: A distributed consistent key-value storeArakoon: A distributed consistent key-value store
Arakoon: A distributed consistent key-value store
Nicolas Trangez
 
“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core
C4Media
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
DataStax Academy
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
Dave Holland
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
Peter Clapham
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
Stackato
StackatoStackato
Stackato
Jonas Brømsø
 
IPv6 Troubleshooting for Helpdesks
IPv6 Troubleshooting for HelpdesksIPv6 Troubleshooting for Helpdesks
IPv6 Troubleshooting for Helpdesks
Deploy360 Programme (Internet Society)
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
Spark Summit
 
Hacklu2011 tricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaud
stricaud
 
DevOps as a Pathway to AWS | AWS Public Sector Summit 2016
DevOps as a Pathway to AWS | AWS Public Sector Summit 2016DevOps as a Pathway to AWS | AWS Public Sector Summit 2016
DevOps as a Pathway to AWS | AWS Public Sector Summit 2016
Amazon Web Services
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
Java Cloud and Container Ready
Java Cloud and Container ReadyJava Cloud and Container Ready
Java Cloud and Container Ready
CodeOps Technologies LLP
 
FireWorks overview
FireWorks overviewFireWorks overview
FireWorks overview
Anubhav Jain
 

Similar to Flaky tests and bugs in Apache software (e.g. Hadoop) (20)

HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Preparing your code for Java 9
Preparing your code for Java 9Preparing your code for Java 9
Preparing your code for Java 9
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Ceph Performance on OpenStack - Barcelona Summit
Ceph Performance on OpenStack - Barcelona SummitCeph Performance on OpenStack - Barcelona Summit
Ceph Performance on OpenStack - Barcelona Summit
 
Arakoon: A distributed consistent key-value store
Arakoon: A distributed consistent key-value storeArakoon: A distributed consistent key-value store
Arakoon: A distributed consistent key-value store
 
“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Stackato
StackatoStackato
Stackato
 
IPv6 Troubleshooting for Helpdesks
IPv6 Troubleshooting for HelpdesksIPv6 Troubleshooting for Helpdesks
IPv6 Troubleshooting for Helpdesks
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
Hacklu2011 tricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaud
 
DevOps as a Pathway to AWS | AWS Public Sector Summit 2016
DevOps as a Pathway to AWS | AWS Public Sector Summit 2016DevOps as a Pathway to AWS | AWS Public Sector Summit 2016
DevOps as a Pathway to AWS | AWS Public Sector Summit 2016
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Java Cloud and Container Ready
Java Cloud and Container ReadyJava Cloud and Container Ready
Java Cloud and Container Ready
 
FireWorks overview
FireWorks overviewFireWorks overview
FireWorks overview
 

More from Akihiro Suda

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
Akihiro Suda
 
20240321 [KubeCon EU Pavilion] Lima.pdf_
20240321 [KubeCon EU Pavilion] Lima.pdf_20240321 [KubeCon EU Pavilion] Lima.pdf_
20240321 [KubeCon EU Pavilion] Lima.pdf_
Akihiro Suda
 
20240320 [KubeCon EU Pavilion] containerd.pdf
20240320 [KubeCon EU Pavilion] containerd.pdf20240320 [KubeCon EU Pavilion] containerd.pdf
20240320 [KubeCon EU Pavilion] containerd.pdf
Akihiro Suda
 
20240201 [HPC Containers] Rootless Containers.pdf
20240201 [HPC Containers] Rootless Containers.pdf20240201 [HPC Containers] Rootless Containers.pdf
20240201 [HPC Containers] Rootless Containers.pdf
Akihiro Suda
 
[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman
Akihiro Suda
 
[KubeConNA2023] Lima pavilion
[KubeConNA2023] Lima pavilion[KubeConNA2023] Lima pavilion
[KubeConNA2023] Lima pavilion
Akihiro Suda
 
[KubeConNA2023] containerd pavilion
[KubeConNA2023] containerd pavilion[KubeConNA2023] containerd pavilion
[KubeConNA2023] containerd pavilion
Akihiro Suda
 
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
Akihiro Suda
 
[CNCF TAG-Runtime] Usernetes Gen2
[CNCF TAG-Runtime] Usernetes Gen2[CNCF TAG-Runtime] Usernetes Gen2
[CNCF TAG-Runtime] Usernetes Gen2
Akihiro Suda
 
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
Akihiro Suda
 
The internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesThe internals and the latest trends of container runtimes
The internals and the latest trends of container runtimes
Akihiro Suda
 
[KubeConEU2023] Lima pavilion
[KubeConEU2023] Lima pavilion[KubeConEU2023] Lima pavilion
[KubeConEU2023] Lima pavilion
Akihiro Suda
 
[KubeConEU2023] containerd pavilion
[KubeConEU2023] containerd pavilion[KubeConEU2023] containerd pavilion
[KubeConEU2023] containerd pavilion
Akihiro Suda
 
[Container Plumbing Days 2023] Why was nerdctl made?
[Container Plumbing Days 2023] Why was nerdctl made?[Container Plumbing Days 2023] Why was nerdctl made?
[Container Plumbing Days 2023] Why was nerdctl made?
Akihiro Suda
 
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
Akihiro Suda
 
[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] Lima[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] Lima
Akihiro Suda
 
[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS
Akihiro Suda
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
Akihiro Suda
 
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
Akihiro Suda
 
[Docker Tokyo #35] Docker 20.10
[Docker Tokyo #35] Docker 20.10[Docker Tokyo #35] Docker 20.10
[Docker Tokyo #35] Docker 20.10
Akihiro Suda
 

More from Akihiro Suda (20)

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
20240321 [KubeCon EU Pavilion] Lima.pdf_
20240321 [KubeCon EU Pavilion] Lima.pdf_20240321 [KubeCon EU Pavilion] Lima.pdf_
20240321 [KubeCon EU Pavilion] Lima.pdf_
 
20240320 [KubeCon EU Pavilion] containerd.pdf
20240320 [KubeCon EU Pavilion] containerd.pdf20240320 [KubeCon EU Pavilion] containerd.pdf
20240320 [KubeCon EU Pavilion] containerd.pdf
 
20240201 [HPC Containers] Rootless Containers.pdf
20240201 [HPC Containers] Rootless Containers.pdf20240201 [HPC Containers] Rootless Containers.pdf
20240201 [HPC Containers] Rootless Containers.pdf
 
[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman
 
[KubeConNA2023] Lima pavilion
[KubeConNA2023] Lima pavilion[KubeConNA2023] Lima pavilion
[KubeConNA2023] Lima pavilion
 
[KubeConNA2023] containerd pavilion
[KubeConNA2023] containerd pavilion[KubeConNA2023] containerd pavilion
[KubeConNA2023] containerd pavilion
 
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
 
[CNCF TAG-Runtime] Usernetes Gen2
[CNCF TAG-Runtime] Usernetes Gen2[CNCF TAG-Runtime] Usernetes Gen2
[CNCF TAG-Runtime] Usernetes Gen2
 
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
 
The internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesThe internals and the latest trends of container runtimes
The internals and the latest trends of container runtimes
 
[KubeConEU2023] Lima pavilion
[KubeConEU2023] Lima pavilion[KubeConEU2023] Lima pavilion
[KubeConEU2023] Lima pavilion
 
[KubeConEU2023] containerd pavilion
[KubeConEU2023] containerd pavilion[KubeConEU2023] containerd pavilion
[KubeConEU2023] containerd pavilion
 
[Container Plumbing Days 2023] Why was nerdctl made?
[Container Plumbing Days 2023] Why was nerdctl made?[Container Plumbing Days 2023] Why was nerdctl made?
[Container Plumbing Days 2023] Why was nerdctl made?
 
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
 
[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] Lima[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] Lima
 
[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
 
[Docker Tokyo #35] Docker 20.10
[Docker Tokyo #35] Docker 20.10[Docker Tokyo #35] Docker 20.10
[Docker Tokyo #35] Docker 20.10
 

Recently uploaded

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 

Recently uploaded (20)

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 

Flaky tests and bugs in Apache software (e.g. Hadoop)

  • 1. Copyright© 2016 NTT Corp. All Rights Reserved. Flaky Tests and Bugs in Apache Software (e.g. Hadoop) Akihiro Suda <suda.akihiro@lab.ntt.co.jp> NTT Software Innovation Center ApacheCon Core North America (May 12, 2016, at Vancouver)
  • 2. 2Copyright© 2016 NTT Corp. All Rights Reserved. • Software Engineer at NTT Corporation • NTT: the largest telecom in Japan • Engaged in improvement on reliability of distributed systems • Some contributions to ZooKeeper / Hadoop including critical bug fixes (non-committer) • github: https://github.com/AkihiroSuda Who am I
  • 3. 3Copyright© 2016 NTT Corp. All Rights Reserved. • Current "flakiness" in Apache software • Why flaky test matters? • What causes a flaky test? • How can we find, reproduce, and fix a flaky test? • Existing work at Apache communities • Our work: Namazu(鯰, catfish) https://github.com/osrg/namazu Agenda
  • 4. 4Copyright© 2016 NTT Corp. All Rights Reserved. Agenda • Current "flakiness" in Apache software • Why flaky test matters? • What causes a flaky test? • How can we find, reproduce, and fix a flaky test? • Existing work at Apache communities • Our work: Namazu(鯰, catfish) https://github.com/osrg/namazu
  • 5. 5Copyright© 2016 NTT Corp. All Rights Reserved. Good News: Apache software are well tested! Software Production code (LOC) Test code (LOC) MapReduce 95K 87K YARN 178K 121K HDFS 152K 150K ZooKeeper 33K 27K HBase 571K 222K Spark 167K 128K Flume 46K 34K Cassandra 168K 78K Data are measured at 14/01/2016, using CLOC Prod Test
  • 6. 6Copyright© 2016 NTT Corp. All Rights Reserved. Bad News: https://builds.apache.org/job/%s-trunk/ MapReduce YARN HDFS ZooKeeper Data are captured at 14/01/2016 HBase Build Build Time Blue = Success Red = Failure I've never seen fully successful Hadoop build, even on my local machine...
  • 7. 7Copyright© 2016 NTT Corp. All Rights Reserved. Bad News: JIRA QL: project = ? AND text ~ "test fail*" Software #Matched #All Issues MapReduce 2,441 (38%) 6,373 YARN 2,290 (63%) 4,756 HDFS 5,141 (53%) 9,672 ZooKeeper 828 (35%) 2,384 HBase 6,595 (42%) 15,542 Spark 794 ( 6%) 14,047 Flume 342 (12%) 2,882 Cassandra 1,656 (15%) 11,430 Data are captured at 4/4/2016 Roughly speaking, the half of Hadoop development is dedicated to debugging test failures. Interestingly, its flakiness seems not uniform across software.. (discussed later) just for approximation
  • 8. 8Copyright© 2016 NTT Corp. All Rights Reserved. Agenda • Current "flakiness" in Apache software • Why flaky test matters? • What causes a flaky test? • How can we find, reproduce, and fix a flaky test? • Existing work at Apache communities • Our work: Namazu(鯰, catfish) https://github.com/osrg/namazu
  • 9. 9Copyright© 2016 NTT Corp. All Rights Reserved. 97% unit test failures in Apache software are said to be harmless for production ("false-alarm") • Information source: "An Empirical Study of Bugs in Test Code" (A.Vahabzadeh et al., ICSME'15) Not all test failures are critical for production..
  • 10. 10Copyright© 2016 NTT Corp. All Rights Reserved. It still matters! For developers.. It's a barrier to promotion of CI • If many tests are flaky, developers tend to ignore CI failure  overlook real bugs It's also a psychological barrier to contribution • A developer may be blamed due to a test failure For users.. It's a barrier to risk assessment for production • No one can tell flaky tests from real bugs So flaky test doesn't matter, as it doesn't affect production?
  • 11. 11Copyright© 2016 NTT Corp. All Rights Reserved. SemaphoreCI suggests "No broken windows" strategy for flaky tests https://semaphoreci.com/community/tutorials/how-to-deal-with-and-eliminate-flaky-tests So flaky test doesn't matter, as it doesn't affect production? image: http://guides.lib.jjay.cuny.edu/nypd/brokenwindows
  • 12. 12Copyright© 2016 NTT Corp. All Rights Reserved. Agenda • Current "flakiness" in Apache software • Why flaky test matters? • What causes a flaky test? • How can we find, reproduce, and fix a flaky test? • Existing work at Apache communities • Our work: Namazu(鯰, catfish) https://github.com/osrg/namazu
  • 13. 13Copyright© 2016 NTT Corp. All Rights Reserved. • Typical flaky test is caused by a malformed async operation like this (A.Vahabzadeh et al., ICSME'15 / Q.Luo et al., ACM FSE'14 / YARN-4478) • Basically it can be fixed by increasing timeout&retries • But it's not easy to find a reasonable timeout value (e.g. YARN-{4804, 4807, 4929...}) • Long timeout is expensive Basic cause: async operation invokeAsyncOperation(); // some tests lack even this sleep sleep(certainHardcodedTimeout); assertTrue(checkSomethingGoodHasHappened());
  • 14. 14Copyright© 2016 NTT Corp. All Rights Reserved. • Host configuration • Host performance • Docker is great! But it still has some issues Testbed (e.g. CI) can cause test failures as well
  • 15. 15Copyright© 2016 NTT Corp. All Rights Reserved. • HADOOP-12687 • Many YARN test fails when /etc/hosts has multiple loopback entries • ZOOKEEPER-2252 • Test: nslookup("a") should fail • It does not fail when there is actually the host named "a“ • INFRA-11811 • JDK was not set up properly in a Jenkins slave • Such a test can fail when the job is assigned to a specific buildbot and it looks like a flaky test CI host configuration can cause test failures
  • 16. 16Copyright© 2016 NTT Corp. All Rights Reserved. CI host performance: they're not made equal • Hadoop's buildbot https://builds.apache.org/computer/ Data are captured at 25/04/2016
  • 17. 17Copyright© 2016 NTT Corp. All Rights Reserved. CI host performance: they're not made equal • Spark's buildbot https://amplab.cs.berkeley.edu/jenkins/computer/
  • 18. 18Copyright© 2016 NTT Corp. All Rights Reserved. CI host performance: they're not made equal • Significant difference in the response time! • Maybe related to the fact that Spark has only a small number of test-related issues (e.g. YARN 63% vs Spark 6% (slide 7)) Target Average Max Min Hadoop 1163ms 1482ms 30ms Spark 3ms 6ms 0ms
  • 19. 19Copyright© 2016 NTT Corp. All Rights Reserved. Docker is great for testing! • Some Apache software are using Docker on their CI (via Apache Yetus) • Apache BigTop also utilizes Docker for provisioning Hadoop • People also loves Docker for setting up test beds on their workstations and laptops • Of course me too Docker issues
  • 20. 20Copyright© 2016 NTT Corp. All Rights Reserved. • Mentioned in several Apache-related issue tickets: • jupyter/docker-stacks#75: Spark hanging • docker-library/cassandra#43, #46 • docker-solr/docker-solr#4 • ALLURA-8039 • AMBARI-14706 • IGNITE-2377 • YETUS-229 … • Fortunately Apache Buildbot (Yetus) didn't hit the bug, but made people's local testbeds flaky in a weird way. • Fixed in recent kernels (so, accurately, it's not a Docker's issue) Docker #18180: Java VM unkillable zombie
  • 21. 21Copyright© 2016 NTT Corp. All Rights Reserved. AUFS: fcntl(F_SETFL, O_APPEND) was not supported (#20199) • Can cause data corruption (Dovecot is known to be affected) • Fixed in recent AUFS Overlay: You should not open O_RDWR and O_RDONLY simultaneously (#10180) • Can cause data corruption (RPM is known to be affected) • Expected behavior, won't get fixed More information: https://github.com/AkihiroSuda/docker-issues Other potential Docker-related issues
  • 22. 22Copyright© 2016 NTT Corp. All Rights Reserved. • Some issues can occur only in a deployed environment rather than in a CI • e.g. TCP packet corruption • Very flaky and critical Flaky test is not limited to xUnit in CI.. TCP
  • 23. 23Copyright© 2016 NTT Corp. All Rights Reserved. https://www.pagerduty.com/blog/the-discovery-of-apache- zookeepers-poison-packet/ • TCP checksum was ignored in some IPsec configuration • ZooKeeper became weird intermittently due to corrupted TCP packet https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip- data-to-mesos-kubernetes-docker-containers- 4986f88f7a19#.gq8chzply • TCP checksum was ignored in some veth configuration • Mesos and Kubernetes are affected TCP packet corruption TCP
  • 24. 24Copyright© 2016 NTT Corp. All Rights Reserved. • It's very hard to notice (and reproduce) flaky TCP packet corruption... • Should distributed systems be TCP-corruption tolerant...? • the probability is very low in regular environments, but it is not zero (32-bit Ethernet CRC + 16-bit TCP checksum) • JIRA issues: ZOOKEEPER-2175, HDFS-8161… TCP packet corruption TCP
  • 25. 25Copyright© 2016 NTT Corp. All Rights Reserved. Agenda • Current "flakiness" in Apache software • Why flaky test matters? • What causes a flaky test? • How can we find, reproduce, and fix a flaky test? • Existing work at Apache communities • Our work: Namazu(鯰, catfish) https://github.com/osrg/namazu
  • 26. 26Copyright© 2016 NTT Corp. All Rights Reserved. • determine-flaky-tests-hadoop.py • Apache Kudu‘s CI (dist_test) • Google's TAP • Our work: Namazu https://github.com/osrg/Namazu • and similar great tools Efforts to find/reproduce a flaky test
  • 27. 27Copyright© 2016 NTT Corp. All Rights Reserved. • Picks up failed tests using Jenkins API • Included in hadoop.git/dev-support (HADOOP- 11045) determine-flaky-tests-hadoop.py $ determine-flaky-tests-hadoop.py --job Hadoop-YARN-trunk ****Recently FAILED builds in url: https://builds.apache.org/job/Hadoop-YARN-trunk ... Among 15 runs examined, all failed tests <#failedRuns: testName>: 7: TestContainerManagerRecovery.testApplicationRecovery ...
  • 28. 28Copyright© 2016 NTT Corp. All Rights Reserved. • Great tool, but it doesn't support running a specific test repeatedly • Also there is a maven dependency issue (YARN- 4478) • B depends on A • TestB is never executed if TestA fails  if TestA is flaky, we can't evaluate the flakiness of TestB! determine-flaky-tests-hadoop.py
  • 29. 29Copyright© 2016 NTT Corp. All Rights Reserved. Kudu's CI: flaky test dashboard http://dist-test.cloudera.org:8080/ (Apr 25) Recently open-sourced and introduced at Apache: Big Data (Monday) https://github.com/cloudera/dist_test
  • 30. 30Copyright© 2016 NTT Corp. All Rights Reserved. Kudu's CI: flaky test dashboard • Tests are run repeatedly on CI to find flaky tests • KUDU_FLAKY_TEST_ATTEMPTS • KUDU_FLAKY_TEST_LIST (From https://github.com/apache/incubator-kudu/commit/1a24338a) Fix flakiness of client_failover-itest The reason this test was flaky is that there is a race between.. .. Looped 100x and they all passed: http://dist-test.cloudera.org/job?job_id=mpercy.1454486819.10566 Author Mike Percy Jan 29, 2016 8:01 AM Committer Todd Lipcon Feb 4, 2016 2:14 PM Commit 1a24338ad60a8842d1ae5e227f8f03e58faea8c0
  • 31. 31Copyright© 2016 NTT Corp. All Rights Reserved. • Google's internal CI • 1.6M test failures per day • 73K (4.5%) are flaky • Repeat a failing test 10 times for labeling flaky tests • Information source: An Empirical Analysis of Flaky Tests (Q.Luo et al. ACM FSE'14) Google's TAP
  • 32. 32Copyright© 2016 NTT Corp. All Rights Reserved. • Modern CIs run jobs repeatedly to find / reproduce flaky tests • But they don't control non-determinism •  Overlook a flaky test •  Can not reproduce a failure  Cannot analyze the failure • Our suggestion: increase non-determinism for finding and reproducing flaky tests Challenge: poor non-determinism
  • 33. 33Copyright© 2016 NTT Corp. All Rights Reserved. NAMAZU: PROGRAMMABLE FUZZY SCHEDULER https://github.com/osrg/namazu NOTE: Namazu was formerly named "Earthquake"
  • 34. 34Copyright© 2016 NTT Corp. All Rights Reserved. Namazu: programmable fuzzy scheduler https://github.com/osrg/namazu Event Fuzzed (Randomized) Schedule Increases non-determinism for finding and reproducing flaky tests Filesystem Packet Go[planned] Linux threadsJava 鯰(namazu) means a catfish in Japanese
  • 35. 35Copyright© 2016 NTT Corp. All Rights Reserved. FUSE Netfilter Openflow Byteman AspectJ Filesystem Packet Go[planned] Linux threadsJava AspectGo [wip] sched_ setattr(2) Namazu uses non-invasive techniques • can be easily applied to any environment • can avoid false-positives Namazu: programmable fuzzy scheduler https://github.com/osrg/namazu https://github.com/AkihiroSuda/golang-exp-aspectgo
  • 36. 36Copyright© 2016 NTT Corp. All Rights Reserved. • xUnit tests • 😃 Easy to get started; just run `mvn` • 😃 Can reproduce test failures observed in CI • 😞 Limited testable scope • Integration tests on a distributed cluster • 😃 Can test everything • 😞 Need to write a script to set up the cluster • But Docker helps us a lot! Namazu targets
  • 37. 37Copyright© 2016 NTT Corp. All Rights Reserved. We support the both scenarios Namazu targets Single-node mode (for xUnit tests) Distributed mode (for integration tests) $ mvn test Orchestrator RPC
  • 38. 38Copyright© 2016 NTT Corp. All Rights Reserved. NAMAZU + XUNIT TESTS $ mvn test
  • 39. 39Copyright© 2016 NTT Corp. All Rights Reserved. • Namazu is a comprehensive framework... • Quick start: “renice” threads for xUnit tests • POSIX.1 requires that threads share the single nice(priority) value, but the actual Linux implementation (NPTL) not. • Not always effective, but it’s generic and easy to get started Namazu + xUnit tests Filesystem Packet Go[planned] Linux threadsJava
  • 40. 40Copyright© 2016 NTT Corp. All Rights Reserved. Namazu + xUnit tests $ PID=$(docker inspect $(docker ps -q -f ancestor=hadoop- build-ubuntu) | jq .[0].State.Pid) $ sudo nmz inspectors proc -pid $PID $ cd hadoop; ./start-build-env.sh [container]$ mvn test –Dtest=TestFoo#testBar Namazu periodically sets random nice values for all the child processes and the threads under $PID Plus utilizes non-default kernel schedulers (e.g. SCHED_BATCH)
  • 41. 41Copyright© 2016 NTT Corp. All Rights Reserved. Namazu + xUnit tests: Reproducibility Testcase Traditional Namazu YARN-4548 RM/TestCapacityScheduler 11% 82% YARN-4556 RM/TestFifoScheduler 2% 44% ZOOKEEPER-2137 ReconfigTest 2% 16% YARN-4168 NM/TestLogAggregationService 1% 8% YARN-1978 NM/TestLogAggregationService 0% 4% YARN-4543 NM/TestNodeStatusUpdater 0% 1% • More information: osrg/namazu#125
  • 42. 42Copyright© 2016 NTT Corp. All Rights Reserved. Namazu + xUnit tests: Reproducibility Testcase Traditional Namazu ZOOKEEPER-2080 ReconfigRecoveryTest 14.0% 61.9% • "Renicing" is not always effective... • But even when renicing is ineffective, sometimes you can also reproduce the flaky test by injecting delays or reordering packets $ sudo iptables ... -j NFQUEUE --queue-num 42 $ sudo nmz inspectors ethernet -nfq-number 42
  • 43. 43Copyright© 2016 NTT Corp. All Rights Reserved. NAMAZU + INTEGRATION TESTS
  • 44. 44Copyright© 2016 NTT Corp. All Rights Reserved. • ZooKeeper: distributed coordination service • used in Hadoop, Spark, Mesos, Kafka.. • ZooKeeper 3.5 (alpha) introduced the dynamic configuration • We performed an integration test so as to evaluate the reliability of the reconfiguration • We found a flaky bug! Namazu + Integration tests
  • 45. 45Copyright© 2016 NTT Corp. All Rights Reserved. • We permuted some specific Ethernet packets in random order using Namazu • TCP retransmissions are eliminated for reducing possible state space Namazu + Integration tests ZooKeeper cluster Open vSwitch + Ryu SDN Framework + Namazu
  • 46. 46Copyright© 2016 NTT Corp. All Rights Reserved. • Bug: New node cannot participate to ZK cluster properly New node cannot become a leader of ZK cluster itself (More technically, it keeps being an "observer“) • Cause: distributed race (ZAB packet vs FLE packet) • ZAB.. atomic broadcast protocol for data • FLE.. leader election protocol for ZK cluster itself Found ZOOKEEPER-2212 Leader of ZK cluster New ZK node ZAB [2888/tcp] FLE [3888/tcp] Uses different TCP connection Non-deterministic packet order
  • 47. 47Copyright© 2016 NTT Corp. All Rights Reserved. Data are captured at 22/01/2016 Found ZOOKEEPER-2212
  • 48. 48Copyright© 2016 NTT Corp. All Rights Reserved. • Expected: ZK cluster works even when 𝑵/𝟐 nodes crashed • Real: single node failure can terminate the 3-node ensemble Found ZOOKEEPER-2212 Not participating properly (keeps being an "observer")
  • 49. 49Copyright© 2016 NTT Corp. All Rights Reserved. • Reproducibility: 0.0%  21.8% (tested 1,000 times) • We could not reproduce the bug even after 5,000 times traditional testing (60 hours!) • Even reproducible by “renicing” threads, but the reproducibility is just 0.7% How hard is it to reproduce?
  • 50. 50Copyright© 2016 NTT Corp. All Rights Reserved. We define the distributed execution pattern based on code coverage: 𝑷 = 𝒑 𝟏,𝟏 ⋯ 𝒑 𝟏,𝑵 ⋮ ⋱ ⋮ 𝒑 𝑳,𝟏 ⋯ 𝒑 𝑳,𝑵 • 𝐿: LOC • 𝑁: Number of nodes (==3 in this case) • 𝑝 𝑖,𝑗: 1 if the node 𝑗 covers the branch in line 𝑖, otherwise 0 • We used JaCoCo: Java Code Coverage Library (patch: ZOOKEEPER-2266) Why we can hit the bug? Namazu achieves faster pattern growth. That's why we can hit the bug.
  • 51. 51Copyright© 2016 NTT Corp. All Rights Reserved. HOW TO USE NAMAZU?
  • 52. 52Copyright© 2016 NTT Corp. All Rights Reserved. Easy to install Easy to get started • Provides Docker-like CLI • No code instrumentation needed • No configuration needed (default: just renice threads) How to use Namazu? $ sudo apt-get install lib{netfilter-queue,zmq3}-dev $ go get github.com/osrg/namazu/nmz $ sudo nmz container run –it –v /foo:/foo ubuntu [container]$ cd /foo && mvn test
  • 53. 53Copyright© 2016 NTT Corp. All Rights Reserved. For threads ("renicing") $ sudo nmz inspectors proc -pid $TARGET_PID $ sudo nmz inspectors fs -mount-point /nmzfs $ sudo iptables ... -j NFQUEUE --queue-num 42 $ sudo nmz inspectors ethernet -nfq-number 42 Need distributed mode? (for integration testing) Just add `--orchestrator-url http://foobar:10080/api/v3` to the CLI. For filesystem For network packets How to use Namazu?
  • 54. 54Copyright© 2016 NTT Corp. All Rights Reserved. Namazu API (Go) type ExplorePolicy interface { QueueEvent(Event) ActionChan() chan Action } func (p *MyPolicy) QueueEvent(event Event) { action := event.DefaultAction() p.timeBoundedQ.Enqueue(action, 10 * Millisecond, 30 * Millisecond) } func (p *MyPolicy) ActionChan() chan Action { return p.timeBoundedQ.DequeueChan } Action is randomly fired in [10ms, 30ms] You can also inject fault actions here Namazu defines REST API, so you can also use other languages An event can contain Ethernet packet bytes
  • 55. 55Copyright© 2016 NTT Corp. All Rights Reserved. • We found a bug: YARN cannot detect disk failure cases where mkdir()/rmdir() blocks • We noticed that the bug can occur theoretically when we are reading the code, and actually produced the bug using Namazu • When we should inject the fault is pre-known; so we manually wrote a concrete scenario using Namazu API • Much more realistic than JUnit + mocking API use case: found YARN-4301 mkdir EIO mkdir ... A case where mkdir() returns EIO explicitly A case where mkdir() blocks
  • 56. 56Copyright© 2016 NTT Corp. All Rights Reserved. func (p *MyPolicy) signalHandler() { signal.Notify(sigChan, syscall.SIGUSR1) for { <-sigChan p.sleep = 10 * time.Minute } } go p.signalHandler() func (p *MyPolicy) QueueEvent(event Event) {..} func (p *MyPolicy) ActionChan() chan Action {..} $ go run mypolicy.go inspectors fs -mount-point /nmzfs Set "yarn.nodemanager.local-dirs" to "/nmzfs/nm-local-dir", Send SIGUSR1 to Namazu when you (and YARN) are ready Interactive test is often easier than writing a JUnit testcase We use SIGUSR1 here, but it is also interesting to implement human-friendly CLI or GUI for interactive testing fault: blocks for 10 minutes API use case: found YARN-4301
  • 57. 57Copyright© 2016 NTT Corp. All Rights Reserved. API use case: found YARN-4301
  • 58. 58Copyright© 2016 NTT Corp. All Rights Reserved. • If you have knowledge on the protocol, you can make a hash for a packet • Note that you have to eliminate time-dependent and random bytes when you hash the packet • Using the hash and Namazu API, you can "semi"- deterministically replay the scenario • Not fully deterministic; it just does its best effort • Record-less! You just need to remember the "seed" for replaying • PoC: ZOOKEEPER-2212: up to 65% reproducibility • More information: osrg/namazu#137 • See also (for Go): https://github.com/AkihiroSuda/go-replay Another API use case: "semi"-deterministic replay
  • 59. 59Copyright© 2016 NTT Corp. All Rights Reserved. SIMILAR GREAT TOOLS
  • 60. 60Copyright© 2016 NTT Corp. All Rights Reserved. • Network partitioner + Linearizability tester • Famous for "Call Me Maybe" blog: http://jepsen.io/ • “Call Me Maybe” by Carly Rae Jepsen (vevo): https://www.youtube.com/watch?v=fWNaR-rxAic • Randomly injects network partition using iptables • "Linearizability" ∈ "Strong consistency" • Integration test on a flaky network rather than a flaky xUnit test Similar great tool: Jepsen
  • 61. 61Copyright© 2016 NTT Corp. All Rights Reserved. • Has been used to test several Apache software • Cassandra: 9851,10001,10068,10231,10413,10674 • http://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen • HBase • Kafka • Solr: 6530, 6583, 6610 • http:///lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen- flaky-networks • ZooKeeper Similar great tool: Jepsen
  • 62. 62Copyright© 2016 NTT Corp. All Rights Reserved. • Namazu is much more generalized • The bugs we found/reproduced are basically beyond the scope of Jepsen (Threads, Disks..) • Namazu can be also combined with Jepsen! It will be our next work.. Namazu + Jepsen? • causes network partition • tests linearizablity • increases non-determinism • injects filesystem faults Jepsen Namazu ...
  • 63. 63Copyright© 2016 NTT Corp. All Rights Reserved. • Make the filesystem flaky using FUSE • Used in testing ScyllaDB (Apache Cassandra's clone) • https://github.com/scylladb/charybdefs • Similar to Namazu FS • Both supports API • Also similar to PetardFS (not active since 2007) • CharybdeFS can be also combined with Namazu as well • CharybdeFS is specialized in FS; Namazu is much more comprehensive. Similar great tool: CharybdeFS
  • 64. 64Copyright© 2016 NTT Corp. All Rights Reserved. https://github.com/NetSys/demi • Found some akka-raft bugs and reproduced a few Spark bugs • challenge in reducing false-positives related to instrumentation • DEMi and Namazu are complementary each other • DEMi is powerful, but has some limitations • Namazu is comprehensive and made easy to get started Similar great tool: DEMi (appeared in NSDI'16) Namazu DEMi Target Generic (Network,Filesystem,Thread..) Akka Getting Started Easy Need to write AspectJ codes Deterministic Replay? No Yes Bug Cause Minimization? No Yes
  • 65. 65Copyright© 2016 NTT Corp. All Rights Reserved. SO... HOW CAN WE FIX FLAKY TESTS?
  • 66. 66Copyright© 2016 NTT Corp. All Rights Reserved. • Namazu finds/reproduces flaky tests, but it doesn't automatically fix them 😞 • Basic approach for async-related flakiness: Adjust the values for sleep() and retries in the test code How can we fix flaky tests? invokeAsyncOperation(); // some tests lack even this sleep sleep(certainHardcodedTimeout); assertTrue(checkSomethingGoodHasHappened());
  • 67. 67Copyright© 2016 NTT Corp. All Rights Reserved. How can we fix flaky tests? invokeAsyncOperation(); // some tests lack even this sleep sleep(certainHardcodedTimeout); assertTrue(checkSomethingGoodHasHappened()); • Suggestion: the timeout(&retries) should be a configurable parameter rather than a hard-coded value Timeout value Cost (time) Risk (timeout) Appropriate for Long High Low • Slow machine (e.g.CI) • Conservative person Short Low High • Fast machine • Risk-appetite person
  • 68. 68Copyright© 2016 NTT Corp. All Rights Reserved. CONCLUSION
  • 69. 69Copyright© 2016 NTT Corp. All Rights Reserved. • Apache software are well tested • But they are flaky • Let’s improve them • Improve asynchronous code • Repeat tests • Our tool can control non-determinism so as to reproduce flaky tests https://github.com/osrg/namazu Conclusion