Scaling mobile testing on AWS: Emulators all the way down

SCALING MOBILETESTING
ON AWS: EMULATORS ALL
THE WAY DOWN
Kim Moir, Mozilla, @kmoir
URES, November 13, 2015
Good morning. My name is Kim Moir and I’m a release engineer at Mozilla. Today I’m going to discuss how we scale our Android testing on AWS. Show of hands -
how many of you test on Android? On a continuous integration farm?

References

Androids by etnyk Attribution-NonCommercial-NoDerivs 2.0 Generic license

https://www.ﬂickr.com/photos/etnyk/5588953445/sizes/l

A little about me. I live in Ottawa, Ontario, Canada. My hobbies include running and making ice cream, which complement each other well. This picture shows a release
engineering ice cream ﬂavour - coﬀee ice cream with chocolate chip cookies soaked in Kahluha. Before I was a release engineer at Mozilla I worked at IBM as a release
engineer on Eclipse. So 12 years working on open source release engineering. I’m really excited to be here today to share my stories, and learn from all of you.

Here’s a picture of the where the amazing Mozilla release engineering team work. As you can see, we are quite distributed across the world, and many of us work
remotely from our homes.

Mozilla is a non-proﬁt. Our mission is to promote openness, innovation & opportunity on the web.

You’re probably familiar with the products we build, such as Firefox for Desktop, Android, iOS and Firefox OS. Firefox for iOS was actually released yesterday - so go
and try it out!

Note that we ship Firefox on four platforms and with ~97 locales on the same day as US English

We have a continuous integration farm running 24x7 on commit. Our release cadence is every six weeks for Firefox for Android. We release betas every week.

https://wiki.mozilla.org/RapidRelease

I’ll talk a little bit about our environment in general, before I delve into our Android test environment.

DAILY
• 350 pushes
• 4700 build jobs
• 150,000 test jobs
Here are some recent numbers on the aggregate jobs we run (all products, not just Firefox for Android). Today, about 66% of build jobs and 80% of test jobs are run on
AWS. We only have our performance tests left that run on raw devices. They can’t run on emulators because performance is not constant.

Each time a developer lands a change, it invokes a series of builds and associated tests on relevant platforms. Within each test job there are many actual test suites that
run.

September:

8188 pushes

https://secure.pub.build.mozilla.org/buildapi/reports/pushes?starttime=1441090800&endtime=1443682800

September jobs

https://secure.pub.build.mozilla.org/buildapi/reports/waittimes?starttime=1441090800&endtime=1443682800

Builds Oct 4-Oct10


builds 15560

Builds Tuesday Oct 6


2814

15 MINUTE SERVICE
We have a commitment to developers that build/test jobs should start within 15 minutes of being requested. We don’t have a perfect record on this, but certainly our
numbers are good. We have metrics that measure this every day so we can see what platforms need additional capacity. And we adjust capacity as needed, and
remove old platforms as they become less relevant in the marketplace.

———

Pizza picture by djwtwo

Attribution-NonCommercial-ShareAlike 2.0 Generic (CC BY-NC-SA 2.0)

https://www.ﬂickr.com/photos/djwtwo/9864611814/sizes/l/

+ many Mozilla tools
Here are some of projects that we use in our infrastructure.

Buildbot is our continuous integration engine. However, we are in the process of migrating to TaskCluster. Task cluster is a set of components that manages task
queuing, scheduling, execution and provisioning of resources. It was designed to run automated builds and test at Mozilla.

We use Puppet for configuration management all our Buildbot servers, and the Linux, Mac and machines. So when we provision new hardware, we just boot the device
and it puppetizes based on it’s role that’s defined by it’s hostname.

Our repository of record is hg.mozilla.org but developers also commit to git repos and these commits are transferred to the hg repository. We also use a lot of mozilla
tools that allow us to scale. These tools are open source as well and I have links at the end of the talk to these repos.

——

References

octokitty http://www.flickr.com/photos/tachikoma/2760470578/sizes/l/

DEVICES
• 6700+ in total
•1900+ for builds
•4700+ for tests
•75% AWS
These numbers are for both Android and desktop devices. The pools overlap.

80% test AWS and 66% build AWS

——-

References

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html

* https://secure.pub.build.mozilla.org/slavealloc/ui/#silos

HISTORY OF MOBILETESTING
AT MOZILLA
Before I talk about where we are today, I’d like to step back and talk about how our mobile testing evolved over the years.

Here’s a picture from 2009 of a mobile pedalboard. This was our ﬁrst attempt at mobile test automation. It was used to report Fennec performance data on the Nokia
N810's

Picture by Aki Sasaki

https://www.ﬂickr.com/photos/drkscrtlv/3590117065/sizes/l

Picture by Aki Sasaki

https://www.ﬂickr.com/photos/drkscrtlv/3590924524/sizes/l

http://escapewindow.dreamwidth.org/205930.html

In 2010, we then moved on to testing on Android 2.2 on Tegras. Tegra are bare reference boards.

We stored Tegra in shoe racks from Bed Bath and Beyond

These shoe racks were stored in a room that was shielded from wireless interference. The shoe racks allowed us to position the phones so they weren’t too close
together, on a material that didn’t get too hot and did not conduct electricity. These racks also allowed us to easily take dead phones out, open, remove batteries,
reimage and replace.

Picture from John O’Duinn’s blog

http://oduinn.com/blog/2010/02/11/unveiling-mozillas-faraday-cage/

http://oduinn.com/images/2013/blog_2013_RelEngAsForceMultiplier.pdf

In 2012, we started running continuous integration tests on Android reference cards in specially designed racks. We started with 800 of them, but only use about 200
today. The cards are called pandas. These were used to run Android 4.0 tests for correctness, debug and performance.

___

References

Pictures of Panda chassis from Dustin’s blog

https://blog.mozilla.org/it/2013/01/04/mozpool/2012-11-09-08-30-03/

They had a custom relay board to allow us to reboot them remotely.



Many racks of pandas

These devices are not as stable as desktop devices, and are prone to failure. Given their numbers, having to deal with the machines failing all the time is very expensive if
they were managed by humans. We wrote some software called mozpool to automatically reimage and reboot them.



WHAT DID WE LEARN?
What did we learn over these iterations of our mobile testing infrastructure?

Each successive mobile testing solution became more reliable (fewer infra failures) and easier to manage via automated tools

Manufacturers EOL reference cards. Old reference cards don’t support new Android versions

Does not scale for peak load

Time consuming and expensive to adjust automation infrastructure to for every new hardware iteration

Picture

https://www.ﬂickr.com/photos/wocintechchat/21909333504/sizes/l from

http://www.wocintechchat.com/blog/wocintechphotos #WOCtechchat

Picture: computer history museum

https://www.ﬂickr.com/photos/indigoprime/2239342335/sizes/o/

We have bursty traﬃc, both for time of day, time of year etc

Example of the number of jobs running per hour in a typical week

Bursty traﬃc - you can see that the number of jobs run each day is variable as time zones wake up, and the large trough is the weekend.

BRANCHING
We have many different branches in Hg at Mozilla. Our Hg branches are all named after different tree species

Developers push to different branches depending on their purpose. Different branches have different scheduling priorities within our continuous integration engine. So
for instance, if a change is landed in a mozilla-beta branch, the builds and tests associated with that change will have machines allocated to them with at a higher priority
than if a change was landed on a cedar branch which is just for testing purposes.

Picture by Aurelio Asiain

Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)

https://flic.kr/p/v27AD

Source: http://opensignal.com/reports/2015/08/android-fragmentation/
What do we need to test? Here’s a picture of Android device fragmentation as of August 2015

Source: http://opensignal.com/reports/2015/08/android-fragmentation/

And here is current Android adoption (October 2015)

Android “Kit Kat” 4.4 has about 40% adoption rate

Android "Jelly Bean" versions (4.1–4.3.1), with a combined share of 30.2%.

Sources

https://en.wikipedia.org/wiki/Android_version_history

ANDROIDTEST PLATFORMS
•Android 2.3, 4,0, 4.2 (x86), 4.3
•Test types
•correctness
•debug
•performance
Obviously, we cannot test on all those platforms and devices, it’s not feasible. We limit our testing to the following platforms.

In 2012, we started moving our build and test infrastructure to Amazon. We first implemented this for desktop Firefox jobs on Linux. We then implemented them for
Android.

Scalable infrastructure for bursty traffic with an API to manage it all.

Scalable

Deals with bursty load

APIs!

Picture by Tim Norris

Create Commons Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0)

https://www.flickr.com/photos/tim_norris/2600844073/sizes/o/

AWSTERMINOLOGY
• EC2 - Elastic compute 2 - machines asVMs
• EBS - Elastic block store - network attached
storage
• Region - separate geographical area
• Availability zone - Multiple, isolated locations
within a region
I’m going to talk a bit about some AWS terms for those of you that may not be familiar with them.

Notes:

AWS instance types http://aws.amazon.com/ec2/instance-types/

MORE AWSTERMS
• AMI - Amazon machine image
• instance type -VM with deﬁned speciﬁcations
and cost per hour. For example:
-AMIs - Amazon has standard ones that you can modify or create your own

-pricing on instance types can depend on the region

-m3.medium currently costs around $0.07hr in most regions (Nov 2015 costs)

-Some instance types may not be available in all availability zones

PUPPETVS AMIS
AMIs are Amazon machine instances

Golden AMIs

We create golden image AMIs via cron each night. These images are generated from our puppet configs. We have different images defined for different instance type
and the role that they perform. For example test and build instances have different libraries and configuration in puppet.

Originally we used puppet to manage all our of build and test instances. It was too slow to puppetize the spot instances

Solution: Create golden AMIs from configs each night via cron. These are used to instantiate the new spot instances.

We also use the same pool AMI to run Android tests and Linux tests, they just run in different directories. Another reason for nightly regeneration is pre-populating VCS
caches to reduce first time startup load.

Picture by shaireproductions

Creative Commons Attribution 2.0 Generic (CC BY 2.0)

https://flic.kr/p/dTfsCs

USE SPOT INSTANCES
• Use spot instances vs on demand instances
• much cheaper
• not instantiated as quickly
• terminated if outbid while running
Amazon has many different types of instances. Initially, we used on demand instances. They instantiate quickly but cost more per hour than other options.

Spot instances are Amazon way of bidding off excess capacity. You can bid for the instance and if nobody else bids for it at a price above your offer, the spot instances
will be instantiated for you. However, if you’re running a spot instance and someone bids a price higher than you did, your instance can be killed. But that’s okay
because we have configured our build farm to retry jobs that failed and a very small percentage are killed this way (< 1%)

Since the spot instances aren’t available as quickly as the on-demand instances, some tests don’t start within 15 minutes but that’s okay. Spot instances are
instantiated every time with the AMI you specify.

Other notes

Smart bidding spot bidding library https://bugzilla.mozilla.org/show_bug.cgi?id=972562

Minimum viable instance type

Run more tests in parallel on a cheaper instance types rather than upgrading instance type

Most tests run on m3.medium but some need more

Limit the subset of tests run on more expensive instance types to those that actually need it

Our tests have a timeout for a suite of tests. If they don’t complete within this timeout, they fail and retry.

It’s much cheaper to run more tests in parallel on a cheaper instance type, than run on a more expensive instance type due to the scale of our operations. For example
our Android 4.3 reftests invoke 48 parallel jobs.

For instance, we have Android tests that run on Emulators on AWS. Some of the reference tests required a c3.xlarge to run.

The correctness tests were ﬁne to run on m3.medium

Picture by kenny magic

Creative Commons Attribution 2.0 Generic (CC BY 2.0)

https://www.ﬂickr.com/photos/kwl/4247555680/sizes/l

WHERE’STHE CODE?
• The tools we use are all open source
• https://github.com/mozilla/build-cloud-tools
• Which use boto libraries (Python interface to
AWS) https://github.com/boto/boto
The code we use to interact with AWS APIs resides here

SMARTER BIDDING
ALGORITHMS
• Important scripts
• aws_stop_idle.py
• aws_watch_pending.py
-stop_idle stops instances that are no longer needed given our current capacity (idle for a certain time period - threshold depends on if on-demand or spot)

-aws_watch_pending activates instances given the criteria on the next slide

REGIONS AND INSTANCES
• Run instances in multiple regions
• Start instances in cheaper regions ﬁrst
• Automatically shut down inactive instances
• Start instances that have been recently running
• Bid on similar instance types
If you look at aws_watch_pending.py, these are some of the rules that it implements

We also use machines in multiple AWS regions, in case one region went down, and also to incur cost savings (some regions are cheaper). Currently we only use us-east1
and us-west2. Since all of our CI infrastructure resides in California, we don’t use most other regions. Unlike some companies that need to have instances available
instantly - for instance I recently saw a talk by Bridget Kromhout (http://bridgetkromhout.com/speaking/2014/beyondthecode/), an operations engineer from DramaFever.
This company provides international movies content on demand. They use every single AWS region because there customer base is so distributed.

Better build times and lower costs if you start instances that have recently been running (still retain artifact dirs, billing advantages)

LIMIT POOL SIZE
Limit pool size

The size of the AWS pools allocated to diﬀerent instance types is limited so if the number of requests spikes we have higher pending counts, but not a huge spike in our
AWS bill.

Bidding algorithm does not bid automatically bring up machines for all pending jobs. Adds some more capacity, waits, re-evaluates pending count, and adds some more
if needed

Similar to thermostat system to heat your house, gradually add more heat

Picture - Ottawa Arboretum - Creative Commons

Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)

https://www.ﬂickr.com/photos/rohit_saxena/4552766281/sizes/l

LIMIT EBS USE
• EBS is network attached store to the EC2VM
• Much cheaper to use the disk that comes with the
instance type

SUMMARY: AWS
• Golden master of AMIs regenerated daily
• Use spot instances
• Smarter bidding algorithms
• Optimize use of regions, instance type and capacity
• Limit pool size and increase capacity gradually
• Use instance storage vs EBS to save $
With these changes, we reduced our initial AWS bill by 70% (as of last year) However, today we use AWS S3 (backend storage) so this has really increased our bill from
our initial implementation (we migrated all of our FTP data to S3)

EMULATOR ENVIRONMENT
(1)
• Android 4.3 (AOSP 4.3.1_r1, JLS36I); standard 2.6.29 kernel
• 1 GB of memory
• 720×1280, 320 dpi screen
• 128 MBVM heap
• 600 MB /data and 600 MB /sdcard partitions
• front and back emulated cameras; all emulated sensors
• standard crashreporter, logcat, anr, and tombstone support
So now that we’ve talked about our AWS environment, let’s talk about our move to emulators

From https://gbrownmozilla.wordpress.com/2015/04/23/android-4-3-opt-tests-running-on-trunk-trees/

EMULATOR ENVIRONMENT
(2)
• Run emulator that comes with Android SDK and
load the custom image, install Firefox apk
• We run tests on a variety of instance types
(m3.medium, m3.xlarge, c3.xlarge)
http://developer.android.com/tools/devices/emulator.html

This a screenshot of when the emulator is starting up. We have a tooling in our test suites that creates a screen shot when the emulator starts, or when a test fails.
These binaries of the screen shots, logs or other testing artifacts are uploaded to Amazon S3 storage and available for developers when their tests fails.

This screenshot is of and android test suite test failure.

Most of the time the logs that are uploaded with the screenshot are more useful.

Example log

http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/
sha512/61c91375333e3265c832cﬀ6f1ﬀ314fb9b70c6a2d15386f0a303c7226cfd1ed7209680d88ac032332907a43cfcf4f03c5f02e5531101ae3b855c699ce1e4e02

ACCESSTO DEVICES
• Access to processes via adb (Android debug
bridge)
• Allows us to kill errant processes
• Some test types require root permissions to copy
ﬁles to certain locations or for other privileged
operations
http://developer.android.com/tools/help/adb.html

MIGRATION PROCESS
• Moved correctness tests, then debug
• Many intermittent issues
• Debug were problematic
• Take longer and consume more resources
Migration Process

Intermittent issues

Debug were problematic

Take longer and consume more resources

MIGRATION LESSONS
• Use more powerful instances types
• Specify timeouts that are longer for individual tests
• Skip tests on certain (slow) platforms
• Split the tests into smaller tests
• Optimize or simplify the test
https://gbrownmozilla.wordpress.com/2015/05/26/handling-intermittent-test-timeouts-in-long-running-tests/

PERFORMANCE TESTS
• Autophone is a Mozilla project measuring page
load performance and testing video playback on
real Android devices
• Provision, verify, recover, run tests and identity
status of variety of phones
Retain small pool of real devices for performance tests

From https://wiki.mozilla.org/Auto-tools/Projects/Autophone

Verify that a phone is working correctly: sd card is writable and not full, etc.

Attempt to recover a phone that reports errors, rerunning the current test/test framework.

Provide at least a high-level status for all phones: whether they are idle, running a test, or disabled/broken.

Support a large number of phones, potentially split amongst several host machines.

EMULATORS IN AWS:THE
GOOD
Emulators: the good

When we want to test a new Android version, we just need a new emulator image, not a new hardware stack. No lead time associated with procuring and installing new
hardware in the data centre.

Increased reliability due to fewer retries (2% vs 18% on Pandas)

Some of that reliability stems from the fact that with the emulator tests will run them from the same, fresh Android image each time. When the tests ran on devices, the
reimaging process took a long time and the devices had to be re-imaged every so often which was a more manual process.

Scalable to deal with daily job spikes

We don’t have to write and maintain software to manage a pool of devices. We can just use the Amazon APIs to provisions resources for our CI system.

Picture by SaturatedEyes - Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)

https://www.ﬂickr.com/photos/shuttershuk/7099823113/sizes/l

EMULATORS IN AWS:THE BAD
• More tests running in parallel (tests run slower,
added more tests)
• No performance tests because we’re running
emulators on emulators
Emulators: the bad

Tests run slower because we’re running tests on emulators on emulators

More tests need to run in parallel because they take longer

Example: Android 4.3 debug tests need to run about 2x many jobs as they did when running on raw devices

No performance tests (have a separate pool of raw devices for this purpose)

As a side note: Amazon has a new oﬀering from this summer called Device Farm which allows you to run tests on a multiple devices. We don’t use it because it is
through an API that doesn’t support the tests harnesses that we use. Also, it doesn’t that doesn’t allow root access to the device. Also, the pricing ($250 a month for a
single dedicated device) is much more expensive than spot instances).

Picture by Tuncay - Creative Commons

Attribution 2.0 Generic (CC BY 2.0) https://www.ﬂickr.com/photos/tuncaycoskun/15809887756/sizes/l

SUMMARY: EMULATORS ON
AWS
• Determine what testing can be done on emulator
vs real device
• Use minimum viable instance type
• Run more tests in parallel
May need larger instance type to speed up longer running tests

Minimize the number of tests that need to run on real hardware. Running tests on real devices in continuous integration is much more complicated/painful that running
them on emulators. Does not allow you to upgrade easily for the next Android version

FUTURE WORK
• Android 5.0 on emulator
• Make it better

WHERE’STHE CODE?
• Cloud tools: https://github.com/mozilla/build-cloud-tools
• buildbot configs https://github.com/mozilla/build-buildbot-configs
• builldbotcustom https://github.com/mozilla/build-buildbotcustom
• Mozharness https://github.com/mozilla/build-mozharness
• Mozpool https://github.com/mozilla/mozpool
• Puppet configs https://github.com/mozilla/build-puppet

LEARN MORE
• @MozRelEng
• http://planet.mozilla.org/releng/
• Mozilla Releng wiki https://wiki.mozilla.org/
ReleaseEngineering
• IRC: channel #releng on moznet

MORE READING 1
• Laura's talks on monitoring complex systems http://vimeo.com/album/3108317/video/
110088288
• Armen’s talk on our hybrid infrastructure https://air.mozilla.org/problems-and-cutting-
costs-for-mozillas-hybrid-ec2-in-house-continuous-integration/
• Move to AWS starting in 2012
• http://atlee.ca/blog/posts/blog20121002ﬁrefox-builds-in-the-cloud.html
• http://johnnybuild.blogspot.ca/2012/08/migrating-linux32-and-linux64-builds-to.html
• http://atlee.ca/blog/posts/blog20121214behind-the-clouds.html
• http://rail.merail.ca/posts/ﬁrefox-unit-tests-on-ubuntu.html
Scaling

http://atlee.ca/blog/posts/bursty-load.html

jacuzzis

http://atlee.ca/blog/posts/initial-jacuzzi-results.html

http://hearsum.ca/blog/experiments-with-smaller-pools-of-build-machines/

Caching

MORE READING 2
• AWS spot instances vs reserved instances
• http://atlee.ca/blog/posts/now-using-aws-spot-instances.html
• http://rail.merail.ca/posts/ﬁrefox-builds-are-way-cheaper-now.html
• http://rail.merail.ca/posts/ec2-spot-instances-experiments.html
• http://taras.glek.net/blog/2014/05/09/how-amazon-ec2-got-15x-cheaper-in-6-months/
• http://taras.glek.net/blog/2014/03/05/more-and-faster-c-i-for-less-on-aws/
• AWS networking
• http://atlee.ca/blog/posts/aws-networks-and-burning-trees.html
• http://rail.merail.ca/posts/using-dns-to-query-aws.html

MORE READING 3
• Scaling
• http://atlee.ca/blog/posts/bursty-load.html
• jacuzzis
• http://atlee.ca/blog/posts/initial-jacuzzi-results.html
• http://hearsum.ca/blog/experiments-with-smaller-pools-of-build-machines/
• Caching
• http://atlee.ca/blog/posts/cache-em-all.html
• Geoffrey Brown’s blog on Android tests https://gbrownmozilla.wordpress.com/

Scaling mobile testing on AWS: Emulators all the way down

More Related Content

What's hot

Similar to Scaling mobile testing on AWS: Emulators all the way down

Scaling mobile testing on AWS: Emulators all the way down