This talk will explore the evolution of Mozilla's continuous integration infrastructure for Firefox for Android. From our early device lab, to running tests on reference cards in custom racks, to our current implementation running on emulators in AWS. In addition, I'll discuss how we reduced the cost of running our tests in AWS by the use of spot instances, and fine tuning the selection of instance types. Finally, I'll discuss how we analyzed regression data to prune the number of tests we run to extend the capacity of our test pools and reduce costs. To give you some scope, our continuous integration farm consists of 6700 machines, 150,000 combined daily build and test jobs that are triggered by an average 300 pushes. This talk was given at USENIX release engineering summit in Washington, DC on November 13, 2015.
Scaling mobile testing on AWS: Emulators all the way down
1. SCALING MOBILETESTING
ON AWS: EMULATORS ALL
THE WAY DOWN
Kim Moir, Mozilla, @kmoir
URES, November 13, 2015
Good morning. My name is Kim Moir and Iām a release engineer at Mozilla. Today Iām going to discuss how we scale our Android testing on AWS. Show of hands -
how many of you test on Android? On a continuous integration farm?
References
Androids by etnyk Attribution-NonCommercial-NoDerivs 2.0 Generic license
https://www.ļ¬ickr.com/photos/etnyk/5588953445/sizes/l
2. A little about me. I live in Ottawa, Ontario, Canada. My hobbies include running and making ice cream, which complement each other well. This picture shows a release
engineering ice cream ļ¬avour - coļ¬ee ice cream with chocolate chip cookies soaked in Kahluha. Before I was a release engineer at Mozilla I worked at IBM as a release
engineer on Eclipse. So 12 years working on open source release engineering. Iām really excited to be here today to share my stories, and learn from all of you.
3. Hereās a picture of the where the amazing Mozilla release engineering team work. As you can see, we are quite distributed across the world, and many of us work
remotely from our homes.
4. Mozilla is a non-proļ¬t. Our mission is to promote openness, innovation & opportunity on the web.
Youāre probably familiar with the products we build, such as Firefox for Desktop, Android, iOS and Firefox OS. Firefox for iOS was actually released yesterday - so go
and try it out!
Note that we ship Firefox on four platforms and with ~97 locales on the same day as US English
5. We have a continuous integration farm running 24x7 on commit. Our release cadence is every six weeks for Firefox for Android. We release betas every week.
https://wiki.mozilla.org/RapidRelease
Iāll talk a little bit about our environment in general, before I delve into our Android test environment.
6. DAILY
ā¢ 350 pushes
ā¢ 4700 build jobs
ā¢ 150,000 test jobs
Here are some recent numbers on the aggregate jobs we run (all products, not just Firefox for Android). Today, about 66% of build jobs and 80% of test jobs are run on
AWS. We only have our performance tests left that run on raw devices. They canāt run on emulators because performance is not constant.
Each time a developer lands a change, it invokes a series of builds and associated tests on relevant platforms. Within each test job there are many actual test suites that
run.
September:
8188 pushes
https://secure.pub.build.mozilla.org/buildapi/reports/pushes?starttime=1441090800&endtime=1443682800
September jobs
https://secure.pub.build.mozilla.org/buildapi/reports/waittimes?starttime=1441090800&endtime=1443682800
Builds Oct 4-Oct10
https://secure.pub.build.mozilla.org/buildapi/reports/waittimes?starttime=1443942000&endtime=1444460400
builds 15560
Builds Tuesday Oct 6
https://secure.pub.build.mozilla.org/buildapi/reports/waittimes?starttime=1444104000&endtime=1444190400
2814
7. 15 MINUTE SERVICE
We have a commitment to developers that build/test jobs should start within 15 minutes of being requested. We donāt have a perfect record on this, but certainly our
numbers are good. We have metrics that measure this every day so we can see what platforms need additional capacity. And we adjust capacity as needed, and
remove old platforms as they become less relevant in the marketplace.
āāā
Pizza picture by djwtwo
Attribution-NonCommercial-ShareAlike 2.0 Generic (CC BY-NC-SA 2.0)
https://www.ļ¬ickr.com/photos/djwtwo/9864611814/sizes/l/
8. + many Mozilla tools
Here are some of projects that we use in our infrastructure.
Buildbot is our continuous integration engine. However, we are in the process of migrating to TaskCluster. Task cluster is a set of components that manages task
queuing, scheduling, execution and provisioning of resources. It was designed to run automated builds and test at Mozilla.
We use Puppet for conļ¬guration management all our Buildbot servers, and the Linux, Mac and machines. So when we provision new hardware, we just boot the device
and it puppetizes based on itās role thatās deļ¬ned by itās hostname.
Our repository of record is hg.mozilla.org but developers also commit to git repos and these commits are transferred to the hg repository. We also use a lot of mozilla
tools that allow us to scale. These tools are open source as well and I have links at the end of the talk to these repos.
āā
References
octokitty http://www.ļ¬ickr.com/photos/tachikoma/2760470578/sizes/l/
9. DEVICES
ā¢ 6700+ in total
ā¢1900+ for builds
ā¢4700+ for tests
ā¢75% AWS
These numbers are for both Android and desktop devices. The pools overlap.
80% test AWS and 66% build AWS
āā-
References
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html
* https://secure.pub.build.mozilla.org/slavealloc/ui/#silos
10. HISTORY OF MOBILETESTING
AT MOZILLA
Before I talk about where we are today, Iād like to step back and talk about how our mobile testing evolved over the years.
Hereās a picture from 2009 of a mobile pedalboard. This was our ļ¬rst attempt at mobile test automation. It was used to report Fennec performance data on the Nokia
N810's
Picture by Aki Sasaki
https://www.ļ¬ickr.com/photos/drkscrtlv/3590117065/sizes/l
11. Picture by Aki Sasaki
https://www.ļ¬ickr.com/photos/drkscrtlv/3590924524/sizes/l
http://escapewindow.dreamwidth.org/205930.html
12. In 2010, we then moved on to testing on Android 2.2 on Tegras. Tegra are bare reference boards.
We stored Tegra in shoe racks from Bed Bath and Beyond
These shoe racks were stored in a room that was shielded from wireless interference. The shoe racks allowed us to position the phones so they werenāt too close
together, on a material that didnāt get too hot and did not conduct electricity. These racks also allowed us to easily take dead phones out, open, remove batteries,
reimage and replace.
Picture from John OāDuinnās blog
http://oduinn.com/blog/2010/02/11/unveiling-mozillas-faraday-cage/
http://oduinn.com/images/2013/blog_2013_RelEngAsForceMultiplier.pdf
13. In 2012, we started running continuous integration tests on Android reference cards in specially designed racks. We started with 800 of them, but only use about 200
today. The cards are called pandas. These were used to run Android 4.0 tests for correctness, debug and performance.
___
References
Pictures of Panda chassis from Dustinās blog
https://blog.mozilla.org/it/2013/01/04/mozpool/2012-11-09-08-30-03/
14. They had a custom relay board to allow us to reboot them remotely.
Pictures of Panda chassis from Dustinās blog
https://blog.mozilla.org/it/2013/01/04/mozpool/2012-11-09-08-30-03/
15. Many racks of pandas
These devices are not as stable as desktop devices, and are prone to failure. Given their numbers, having to deal with the machines failing all the time is very expensive if
they were managed by humans. We wrote some software called mozpool to automatically reimage and reboot them.
Pictures of Panda chassis from Dustinās blog
https://blog.mozilla.org/it/2013/01/04/mozpool/2012-11-09-08-30-03/
16. WHAT DID WE LEARN?
What did we learn over these iterations of our mobile testing infrastructure?
Each successive mobile testing solution became more reliable (fewer infra failures) and easier to manage via automated tools
Manufacturers EOL reference cards. Old reference cards donāt support new Android versions
Does not scale for peak load
Time consuming and expensive to adjust automation infrastructure to for every new hardware iteration
Picture
https://www.ļ¬ickr.com/photos/wocintechchat/21909333504/sizes/l from
http://www.wocintechchat.com/blog/wocintechphotos #WOCtechchat
Picture: computer history museum
https://www.ļ¬ickr.com/photos/indigoprime/2239342335/sizes/o/
17. We have bursty traļ¬c, both for time of day, time of year etc
Example of the number of jobs running per hour in a typical week
Bursty traļ¬c - you can see that the number of jobs run each day is variable as time zones wake up, and the large trough is the weekend.
18. BRANCHING
We have many diļ¬erent branches in Hg at Mozilla. Our Hg branches are all named after diļ¬erent tree species
Developers push to diļ¬erent branches depending on their purpose. Diļ¬erent branches have diļ¬erent scheduling priorities within our continuous integration engine. So
for instance, if a change is landed in a mozilla-beta branch, the builds and tests associated with that change will have machines allocated to them with at a higher priority
than if a change was landed on a cedar branch which is just for testing purposes.
Picture by Aurelio Asiain
Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://ļ¬ic.kr/p/v27AD
20. And here is current Android adoption (October 2015)
Android āKit Katā 4.4 has about 40% adoption rate
Android "Jelly Bean" versions (4.1ā4.3.1), with a combined share of 30.2%.
Sources
https://en.wikipedia.org/wiki/Android_version_history
21. ANDROIDTEST PLATFORMS
ā¢Android 2.3, 4,0, 4.2 (x86), 4.3
ā¢Test types
ā¢correctness
ā¢debug
ā¢performance
Obviously, we cannot test on all those platforms and devices, itās not feasible. We limit our testing to the following platforms.
22. In 2012, we started moving our build and test infrastructure to Amazon. We ļ¬rst implemented this for desktop Firefox jobs on Linux. We then implemented them for
Android.
Scalable infrastructure for bursty traļ¬c with an API to manage it all.
Scalable
Deals with bursty load
APIs!
Picture by Tim Norris
Create Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)
https://www.ļ¬ickr.com/photos/tim_norris/2600844073/sizes/o/
23. AWSTERMINOLOGY
ā¢ EC2 - Elastic compute 2 - machines asVMs
ā¢ EBS - Elastic block store - network attached
storage
ā¢ Region - separate geographical area
ā¢ Availability zone - Multiple, isolated locations
within a region
Iām going to talk a bit about some AWS terms for those of you that may not be familiar with them.
Notes:
AWS instance types http://aws.amazon.com/ec2/instance-types/
24. MORE AWSTERMS
ā¢ AMI - Amazon machine image
ā¢ instance type -VM with deļ¬ned speciļ¬cations
and cost per hour. For example:
-AMIs - Amazon has standard ones that you can modify or create your own
-pricing on instance types can depend on the region
-m3.medium currently costs around $0.07hr in most regions (Nov 2015 costs)
-Some instance types may not be available in all availability zones
25. PUPPETVS AMIS
AMIs are Amazon machine instances
Golden AMIs
We create golden image AMIs via cron each night. These images are generated from our puppet conļ¬gs. We have diļ¬erent images deļ¬ned for diļ¬erent instance type
and the role that they perform. For example test and build instances have diļ¬erent libraries and conļ¬guration in puppet.
Originally we used puppet to manage all our of build and test instances. It was too slow to puppetize the spot instances
Solution: Create golden AMIs from conļ¬gs each night via cron. These are used to instantiate the new spot instances.
We also use the same pool AMI to run Android tests and Linux tests, they just run in diļ¬erent directories. Another reason for nightly regeneration is pre-populating VCS
caches to reduce ļ¬rst time startup load.
Picture by shaireproductions
Creative Commons Attribution 2.0 Generic (CC BY 2.0)
https://ļ¬ic.kr/p/dTfsCs
26. USE SPOT INSTANCES
ā¢ Use spot instances vs on demand instances
ā¢ much cheaper
ā¢ not instantiated as quickly
ā¢ terminated if outbid while running
Amazon has many diļ¬erent types of instances. Initially, we used on demand instances. They instantiate quickly but cost more per hour than other options.
Spot instances are Amazon way of bidding oļ¬ excess capacity. You can bid for the instance and if nobody else bids for it at a price above your oļ¬er, the spot instances
will be instantiated for you. However, if youāre running a spot instance and someone bids a price higher than you did, your instance can be killed. But thatās okay
because we have conļ¬gured our build farm to retry jobs that failed and a very small percentage are killed this way (< 1%)
Since the spot instances arenāt available as quickly as the on-demand instances, some tests donāt start within 15 minutes but thatās okay. Spot instances are
instantiated every time with the AMI you specify.
Other notes
Smart bidding spot bidding library https://bugzilla.mozilla.org/show_bug.cgi?id=972562
27. Minimum viable instance type
Run more tests in parallel on a cheaper instance types rather than upgrading instance type
Most tests run on m3.medium but some need more
Limit the subset of tests run on more expensive instance types to those that actually need it
Our tests have a timeout for a suite of tests. If they donāt complete within this timeout, they fail and retry.
Itās much cheaper to run more tests in parallel on a cheaper instance type, than run on a more expensive instance type due to the scale of our operations. For example
our Android 4.3 reftests invoke 48 parallel jobs.
For instance, we have Android tests that run on Emulators on AWS. Some of the reference tests required a c3.xlarge to run.
The correctness tests were ļ¬ne to run on m3.medium
Picture by kenny magic
Creative Commons Attribution 2.0 Generic (CC BY 2.0)
https://www.ļ¬ickr.com/photos/kwl/4247555680/sizes/l
28. WHEREāSTHE CODE?
ā¢ The tools we use are all open source
ā¢ https://github.com/mozilla/build-cloud-tools
ā¢ Which use boto libraries (Python interface to
AWS) https://github.com/boto/boto
The code we use to interact with AWS APIs resides here
29. SMARTER BIDDING
ALGORITHMS
ā¢ Important scripts
ā¢ aws_stop_idle.py
ā¢ aws_watch_pending.py
-stop_idle stops instances that are no longer needed given our current capacity (idle for a certain time period - threshold depends on if on-demand or spot)
-aws_watch_pending activates instances given the criteria on the next slide
30. REGIONS AND INSTANCES
ā¢ Run instances in multiple regions
ā¢ Start instances in cheaper regions ļ¬rst
ā¢ Automatically shut down inactive instances
ā¢ Start instances that have been recently running
ā¢ Bid on similar instance types
If you look at aws_watch_pending.py, these are some of the rules that it implements
We also use machines in multiple AWS regions, in case one region went down, and also to incur cost savings (some regions are cheaper). Currently we only use us-east1
and us-west2. Since all of our CI infrastructure resides in California, we donāt use most other regions. Unlike some companies that need to have instances available
instantly - for instance I recently saw a talk by Bridget Kromhout (http://bridgetkromhout.com/speaking/2014/beyondthecode/), an operations engineer from DramaFever.
This company provides international movies content on demand. They use every single AWS region because there customer base is so distributed.
Better build times and lower costs if you start instances that have recently been running (still retain artifact dirs, billing advantages)
31. LIMIT POOL SIZE
Limit pool size
The size of the AWS pools allocated to diļ¬erent instance types is limited so if the number of requests spikes we have higher pending counts, but not a huge spike in our
AWS bill.
Bidding algorithm does not bid automatically bring up machines for all pending jobs. Adds some more capacity, waits, re-evaluates pending count, and adds some more
if needed
Similar to thermostat system to heat your house, gradually add more heat
Picture - Ottawa Arboretum - Creative Commons
Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)
https://www.ļ¬ickr.com/photos/rohit_saxena/4552766281/sizes/l
32. LIMIT EBS USE
ā¢ EBS is network attached store to the EC2VM
ā¢ Much cheaper to use the disk that comes with the
instance type
33. SUMMARY: AWS
ā¢ Golden master of AMIs regenerated daily
ā¢ Use spot instances
ā¢ Smarter bidding algorithms
ā¢ Optimize use of regions, instance type and capacity
ā¢ Limit pool size and increase capacity gradually
ā¢ Use instance storage vs EBS to save $
With these changes, we reduced our initial AWS bill by 70% (as of last year) However, today we use AWS S3 (backend storage) so this has really increased our bill from
our initial implementation (we migrated all of our FTP data to S3)
34. EMULATOR ENVIRONMENT
(1)
ā¢ Android 4.3 (AOSP 4.3.1_r1, JLS36I); standard 2.6.29 kernel
ā¢ 1 GB of memory
ā¢ 720Ć1280, 320 dpi screen
ā¢ 128 MBVM heap
ā¢ 600 MB /data and 600 MB /sdcard partitions
ā¢ front and back emulated cameras; all emulated sensors
ā¢ standard crashreporter, logcat, anr, and tombstone support
So now that weāve talked about our AWS environment, letās talk about our move to emulators
From https://gbrownmozilla.wordpress.com/2015/04/23/android-4-3-opt-tests-running-on-trunk-trees/
35. EMULATOR ENVIRONMENT
(2)
ā¢ Run emulator that comes with Android SDK and
load the custom image, install Firefox apk
ā¢ We run tests on a variety of instance types
(m3.medium, m3.xlarge, c3.xlarge)
http://developer.android.com/tools/devices/emulator.html
36. This a screenshot of when the emulator is starting up. We have a tooling in our test suites that creates a screen shot when the emulator starts, or when a test fails.
These binaries of the screen shots, logs or other testing artifacts are uploaded to Amazon S3 storage and available for developers when their tests fails.
37. This screenshot is of and android test suite test failure.
Most of the time the logs that are uploaded with the screenshot are more useful.
Example log
http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/
sha512/61c91375333e3265c832cļ¬6f1ļ¬314fb9b70c6a2d15386f0a303c7226cfd1ed7209680d88ac032332907a43cfcf4f03c5f02e5531101ae3b855c699ce1e4e02
38. ACCESSTO DEVICES
ā¢ Access to processes via adb (Android debug
bridge)
ā¢ Allows us to kill errant processes
ā¢ Some test types require root permissions to copy
ļ¬les to certain locations or for other privileged
operations
http://developer.android.com/tools/help/adb.html
39. MIGRATION PROCESS
ā¢ Moved correctness tests, then debug
ā¢ Many intermittent issues
ā¢ Debug were problematic
ā¢ Take longer and consume more resources
Migration Process
Intermittent issues
Debug were problematic
Take longer and consume more resources
40. MIGRATION LESSONS
ā¢ Use more powerful instances types
ā¢ Specify timeouts that are longer for individual tests
ā¢ Skip tests on certain (slow) platforms
ā¢ Split the tests into smaller tests
ā¢ Optimize or simplify the test
https://gbrownmozilla.wordpress.com/2015/05/26/handling-intermittent-test-timeouts-in-long-running-tests/
41. PERFORMANCE TESTS
ā¢ Autophone is a Mozilla project measuring page
load performance and testing video playback on
real Android devices
ā¢ Provision, verify, recover, run tests and identity
status of variety of phones
Retain small pool of real devices for performance tests
From https://wiki.mozilla.org/Auto-tools/Projects/Autophone
Verify that a phone is working correctly: sd card is writable and not full, etc.
Attempt to recover a phone that reports errors, rerunning the current test/test framework.
Provide at least a high-level status for all phones: whether they are idle, running a test, or disabled/broken.
Support a large number of phones, potentially split amongst several host machines.
42. EMULATORS IN AWS:THE
GOOD
Emulators: the good
When we want to test a new Android version, we just need a new emulator image, not a new hardware stack. No lead time associated with procuring and installing new
hardware in the data centre.
Increased reliability due to fewer retries (2% vs 18% on Pandas)
Some of that reliability stems from the fact that with the emulator tests will run them from the same, fresh Android image each time. When the tests ran on devices, the
reimaging process took a long time and the devices had to be re-imaged every so often which was a more manual process.
Scalable to deal with daily job spikes
We donāt have to write and maintain software to manage a pool of devices. We can just use the Amazon APIs to provisions resources for our CI system.
Picture by SaturatedEyes - Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)
https://www.ļ¬ickr.com/photos/shuttershuk/7099823113/sizes/l
43. EMULATORS IN AWS:THE BAD
ā¢ More tests running in parallel (tests run slower,
added more tests)
ā¢ No performance tests because weāre running
emulators on emulators
Emulators: the bad
Tests run slower because weāre running tests on emulators on emulators
More tests need to run in parallel because they take longer
Example: Android 4.3 debug tests need to run about 2x many jobs as they did when running on raw devices
No performance tests (have a separate pool of raw devices for this purpose)
As a side note: Amazon has a new oļ¬ering from this summer called Device Farm which allows you to run tests on a multiple devices. We donāt use it because it is
through an API that doesnāt support the tests harnesses that we use. Also, it doesnāt that doesnāt allow root access to the device. Also, the pricing ($250 a month for a
single dedicated device) is much more expensive than spot instances).
Picture by Tuncay - Creative Commons
Attribution 2.0 Generic (CC BY 2.0) https://www.ļ¬ickr.com/photos/tuncaycoskun/15809887756/sizes/l
44. SUMMARY: EMULATORS ON
AWS
ā¢ Determine what testing can be done on emulator
vs real device
ā¢ Use minimum viable instance type
ā¢ Run more tests in parallel
May need larger instance type to speed up longer running tests
Minimize the number of tests that need to run on real hardware. Running tests on real devices in continuous integration is much more complicated/painful that running
them on emulators. Does not allow you to upgrade easily for the next Android version
48. LEARN MORE
ā¢ @MozRelEng
ā¢ http://planet.mozilla.org/releng/
ā¢ Mozilla Releng wiki https://wiki.mozilla.org/
ReleaseEngineering
ā¢ IRC: channel #releng on moznet
49. MORE READING 1
ā¢ Laura's talks on monitoring complex systems http://vimeo.com/album/3108317/video/
110088288
ā¢ Armenās talk on our hybrid infrastructure https://air.mozilla.org/problems-and-cutting-
costs-for-mozillas-hybrid-ec2-in-house-continuous-integration/
ā¢ Move to AWS starting in 2012
ā¢ http://atlee.ca/blog/posts/blog20121002ļ¬refox-builds-in-the-cloud.html
ā¢ http://johnnybuild.blogspot.ca/2012/08/migrating-linux32-and-linux64-builds-to.html
ā¢ http://atlee.ca/blog/posts/blog20121214behind-the-clouds.html
ā¢ http://rail.merail.ca/posts/ļ¬refox-unit-tests-on-ubuntu.html
Scaling
http://atlee.ca/blog/posts/bursty-load.html
jacuzzis
http://atlee.ca/blog/posts/initial-jacuzzi-results.html
http://hearsum.ca/blog/experiments-with-smaller-pools-of-build-machines/
Caching