HOW WE’VE BUILT
YAHOO FANTASY FOOTBALL
Alex Florescu
Yahoo UK
#droidconit
April 10th, 2015
OVERVIEW
Intro
Principles & practices
Testing
Internationalisation
Instrumentation & A/B testing
Performance
INTRO
London team started in January 2014
Fantasy Football (Fantasy Calcio) launched in July 2014
Android / iOS / Web clients + back-end team
THE APP
100k+ MAUs (on Android), ★★★★☆
Premier League, Campionato Italiano, Ligue 1©, Bundesliga, La Liga, MLS
KEY PRINCIPLES
Automate everything
Short release cycle
Performance, stability, quick changes
Track and measure everything
Data-driven product decisions
Stress and enforce principles, not process
ENGINEERING PRACTICES - CI
CI pipeline from day one
CD up to internal deployment
Unit testing & UI testing
Automatic APK generation and signing
Compile time configs for dev, dogfood & production builds
ENGINEERING PRACTICES - CI
Git flow: Work on a branch, do a pull request to merge
Short lived branches, keep PRs brief
Master always builds, always shippable
All code must be reviewed
Compile-time feature toggles “disable” code that is not ready
TESTING
CI without automated testing is …
Different levels of testing
On commit hook: robolectric suite
Next stage, smoke suite of UI tests
Nightly: full suite of UI tests, performance tests, monkey tests
ROBOLECTRICTESTING
Robolectric tests run on JVM, no devices needed
Slower than plain JUnit tests, but significantly faster than UI tests
Very useful as unit tests
With architectures such as MVP, can also be acceptance tests
ROBOLECTRIC PROBLEMS
Not all Android framework functionality is replicated
Differences between JVM and DalvikVM
Difficult to test complex user flows over multiple screens
Custom views sometimes problematic
OUR NUMBERS
700+ tests
50-60% coverage (higher in biz logic, lower in UI)
2’ to run, 6’ full build from scratch
UITESTS
Good:
Proper integration tests
Run on device
Most closely resembling real user flows
Can catch device specific issues
UITESTS
Bad:
Synchronisation problems (e.g. Button “OK” not found)
Brittle, hard to maintain
Very slow to run
Requires a device lab to be setup for CI
SMOKE SUITEVS FULL SUITE
Even small suites can take hours to run because of sync issues
For sanity checking, a smoke suite will do
Relatively fast (10-15min) & simple UI test
Ensure app runs and can see all screens
FULL SUITE
For enhanced testing, a nightly full suite
In-depth user flow tests, can run for
hours
Make sure someone checks it daily!
Should be a release blocker
CI PIPELINE
MONKEYTESTING
Useful for stability testing
Catches crashes and memory leaks
Could be included in automated nightly runs
Make sure app activity is restricted
Lock monkey in app (e.g. Surelock)
Consider removing certain features when monkey runs
TRACKINGTESTING
Coverage useful for analysis (e.g. what areas get the least testing
and why?), but should not enforce a coverage target
Reasonable to expect acceptance tests with features
Enforce testing through code review
Tests are code!
Refactoring, good architecture, documentation, still apply
I18N, L10N …
Translation: strings only
Localisation: adapting content for language, culture and region
Internationalisation: designing a product to allow localisation
CALCIO, SOCCER, FUßBALL…
We shipped to 20+ locales from day one
Challenges:
All strings needs to be translated
Number formatting, currency formatting etc.
Support, reviews, release notes
Testing load increased — UI issues with some locales only
I18N — DEALING WITH IT
Externalise all strings and enforce no lint errors on build
Collect all strings early for translation before they block release
Have standard release notes saved & translated for emergencies
Some test devices permanently on tricky locales
I13N — INSTRUMENTATION
What
Collecting data to understand how an app performs and how it
is used
Why
Key to understanding what the users are doing
WHATTO INSTRUMENT
Time spent in app
Buttons tapped
Loading time, network performance
Anything you want!
WHATTO DO WITH DATA
How long does it take a user to create a team?
What are the best triggers for a user to sign in?
How often do users share something with friends?
Signs of frustration: e.g. repeating identical action
13N CHALLENGES
Collecting the data is the easy part (and it’s not easy)
Don’t reinvent the wheel, use 3rd party tools for this
We use Flurry
Real challenge:
What does user engagement mean? How do you measure it?
A/BTESTING — WHY?
What makes users more likely to invite or share with friends?
What makes users more likely to be engaged? Happy?
What features do we add or remove?
Is a new feature supporting our high level goals?
Goal: maximum user satisfaction and engagement with minimum
number of features
EXPERIMENTS
Build-up an MVP of your new feature
Enabled the feature in a test bucket (e.g. only for 10% of users)
Data is collected for all users, bucket-aware and results are
compared across test and control bucket
Results can be used to guide product decisions
EXPERIMENT EXAMPLE
Hypothesis:A prompt to
share the newly created
league will increase the
number of shares
EXPERIMENT RESULTS
Succesful!
71% of users that see the prompt share the league
EXPERIMENT EXAMPLE
Hypothesis: A tutorial will
increase the number of
completed teams
EXPERIMENT RESULTS
Completion team was actually unaffected: hypothesis rejected
But, significantly more likely that they will complete the team in
the same session
EXPERIMENTS
“Guesses” are not necessarily right
“Obvious” improvements may not be
Used correctly, real world data provides proof
PERFORMANCE
Caring is measuring
What numbers we track
Cold start time
FPS
Automated measurements (e.g. nightly build to track progress)
Track production numbers — this is what matters
PERFORMANCE
Numbers will vary wildly in different regions
Slower networks, older devices
When we started monitoring our world average for load time
was ~2-3x our US/UK one
PERFORMANCE
WRAP-UP
CI & automated testing are key for quality and stability
Instrument everything, use data to experiment and guide product
A/B testing can confirm product hypothesis
You should localise your apps, but know what you’re getting into
Performance needs prod monitoring and on-going measurement
Q & A
yahoo-mep.tumblr.com
www.florescu.org
@flor3scu

How we've built Yahoo Fantasy Football (Droidcon Italy '15)

  • 1.
    HOW WE’VE BUILT YAHOOFANTASY FOOTBALL Alex Florescu Yahoo UK #droidconit April 10th, 2015
  • 2.
  • 3.
    INTRO London team startedin January 2014 Fantasy Football (Fantasy Calcio) launched in July 2014 Android / iOS / Web clients + back-end team
  • 4.
    THE APP 100k+ MAUs(on Android), ★★★★☆ Premier League, Campionato Italiano, Ligue 1©, Bundesliga, La Liga, MLS
  • 5.
    KEY PRINCIPLES Automate everything Shortrelease cycle Performance, stability, quick changes Track and measure everything Data-driven product decisions Stress and enforce principles, not process
  • 6.
    ENGINEERING PRACTICES -CI CI pipeline from day one CD up to internal deployment Unit testing & UI testing Automatic APK generation and signing Compile time configs for dev, dogfood & production builds
  • 7.
    ENGINEERING PRACTICES -CI Git flow: Work on a branch, do a pull request to merge Short lived branches, keep PRs brief Master always builds, always shippable All code must be reviewed Compile-time feature toggles “disable” code that is not ready
  • 8.
    TESTING CI without automatedtesting is … Different levels of testing On commit hook: robolectric suite Next stage, smoke suite of UI tests Nightly: full suite of UI tests, performance tests, monkey tests
  • 9.
    ROBOLECTRICTESTING Robolectric tests runon JVM, no devices needed Slower than plain JUnit tests, but significantly faster than UI tests Very useful as unit tests With architectures such as MVP, can also be acceptance tests
  • 10.
    ROBOLECTRIC PROBLEMS Not allAndroid framework functionality is replicated Differences between JVM and DalvikVM Difficult to test complex user flows over multiple screens Custom views sometimes problematic
  • 11.
    OUR NUMBERS 700+ tests 50-60%coverage (higher in biz logic, lower in UI) 2’ to run, 6’ full build from scratch
  • 12.
    UITESTS Good: Proper integration tests Runon device Most closely resembling real user flows Can catch device specific issues
  • 13.
    UITESTS Bad: Synchronisation problems (e.g.Button “OK” not found) Brittle, hard to maintain Very slow to run Requires a device lab to be setup for CI
  • 14.
    SMOKE SUITEVS FULLSUITE Even small suites can take hours to run because of sync issues For sanity checking, a smoke suite will do Relatively fast (10-15min) & simple UI test Ensure app runs and can see all screens
  • 15.
    FULL SUITE For enhancedtesting, a nightly full suite In-depth user flow tests, can run for hours Make sure someone checks it daily! Should be a release blocker
  • 16.
  • 17.
    MONKEYTESTING Useful for stabilitytesting Catches crashes and memory leaks Could be included in automated nightly runs Make sure app activity is restricted Lock monkey in app (e.g. Surelock) Consider removing certain features when monkey runs
  • 18.
    TRACKINGTESTING Coverage useful foranalysis (e.g. what areas get the least testing and why?), but should not enforce a coverage target Reasonable to expect acceptance tests with features Enforce testing through code review Tests are code! Refactoring, good architecture, documentation, still apply
  • 19.
    I18N, L10N … Translation:strings only Localisation: adapting content for language, culture and region Internationalisation: designing a product to allow localisation
  • 20.
    CALCIO, SOCCER, FUßBALL… Weshipped to 20+ locales from day one Challenges: All strings needs to be translated Number formatting, currency formatting etc. Support, reviews, release notes Testing load increased — UI issues with some locales only
  • 21.
    I18N — DEALINGWITH IT Externalise all strings and enforce no lint errors on build Collect all strings early for translation before they block release Have standard release notes saved & translated for emergencies Some test devices permanently on tricky locales
  • 22.
    I13N — INSTRUMENTATION What Collectingdata to understand how an app performs and how it is used Why Key to understanding what the users are doing
  • 23.
    WHATTO INSTRUMENT Time spentin app Buttons tapped Loading time, network performance Anything you want!
  • 24.
    WHATTO DO WITHDATA How long does it take a user to create a team? What are the best triggers for a user to sign in? How often do users share something with friends? Signs of frustration: e.g. repeating identical action
  • 25.
    13N CHALLENGES Collecting thedata is the easy part (and it’s not easy) Don’t reinvent the wheel, use 3rd party tools for this We use Flurry Real challenge: What does user engagement mean? How do you measure it?
  • 26.
    A/BTESTING — WHY? Whatmakes users more likely to invite or share with friends? What makes users more likely to be engaged? Happy? What features do we add or remove? Is a new feature supporting our high level goals? Goal: maximum user satisfaction and engagement with minimum number of features
  • 27.
    EXPERIMENTS Build-up an MVPof your new feature Enabled the feature in a test bucket (e.g. only for 10% of users) Data is collected for all users, bucket-aware and results are compared across test and control bucket Results can be used to guide product decisions
  • 28.
    EXPERIMENT EXAMPLE Hypothesis:A promptto share the newly created league will increase the number of shares
  • 29.
    EXPERIMENT RESULTS Succesful! 71% ofusers that see the prompt share the league
  • 30.
    EXPERIMENT EXAMPLE Hypothesis: Atutorial will increase the number of completed teams
  • 31.
    EXPERIMENT RESULTS Completion teamwas actually unaffected: hypothesis rejected But, significantly more likely that they will complete the team in the same session
  • 32.
    EXPERIMENTS “Guesses” are notnecessarily right “Obvious” improvements may not be Used correctly, real world data provides proof
  • 33.
    PERFORMANCE Caring is measuring Whatnumbers we track Cold start time FPS Automated measurements (e.g. nightly build to track progress) Track production numbers — this is what matters
  • 34.
    PERFORMANCE Numbers will varywildly in different regions Slower networks, older devices When we started monitoring our world average for load time was ~2-3x our US/UK one
  • 35.
  • 36.
    WRAP-UP CI & automatedtesting are key for quality and stability Instrument everything, use data to experiment and guide product A/B testing can confirm product hypothesis You should localise your apps, but know what you’re getting into Performance needs prod monitoring and on-going measurement
  • 37.