Reliability Patterns
for Large-scale
Selenium Tests
Waseem Hamshawi
Senior Automation Engineer, LivePerson
waseemh@liveperson.com
github.com/waseemh
linkedin.com/in/waseem-hamshawi
Liveperson transforms the connection between
brands and consumers
200BN
API calls
per month
7
Data
Centers
(worldwide)
6000
Virtual
Servers
3BN
Visits
Per month
25
Scenarios
350,000
Executions
per day
7
Data
Centers
(Geo-distributed)
70
Virtual
Machines
SYML
LO
AMS
VA
CA
Distributed Testing
Scalability vs Reliability
● Scale goes up → More deployments
● Minimize risk when modifying tests codebase
● Apply software deployment principals to test automation
Reliable Deployments
Pre-Staging Staging Production
(SSOT)
Failure rate > 3 % Failure rate > 3 %
Failure rate > 3 %
Rollout Rollout
● Multi-tier phased deployment
● Rollout criteria: failure rate < x% for [0,t] time range
● Feature Flags
● Rule of Thumb: External access points and APIs will fail
● Test code should be tolerance to external failures
● Provide Fallbacks to every access point
getRemoteWebDriver()
Primary
Grid
Secondary
Grid
getFallback()
If primary fail
Single Point of Failures
● Selenium Grid - Single point of failure
Secondary Hub/Grid failover
Single Point of Failures
getRemoteWebDriver()
Grid 1
Grid 2 new RemoteWebDriver()
getFallback()
If primary or secondary fail
Grid Load Balancing
Grid 3
Grid N
.
.
Load Balancer
● Breakdown large Grids into smaller ones
● Redundant, highly available Grid setup
● Monitor and load balance
● Open source LB: Ribbon (github.com/Netflix/ribbon)
Circuit Breakers
● Don’t beat a dead horse
● Prevent unnecessary timed-out requests
● Shed load on external access point
● Failure Handling Open source libraries:
○ Hystrix (github.com/Netflix/Hystrix)
○ Failsafe (github.com/jhalterman/failsafe)
Single Point of Failures
OPENCLOSED
HALF OPEN
fail [threshold reached]
Open Circuit timeout
reached
fail
success
success
Fallback
Hystrix Command - Example
Test
Test
Cluster-wide Distribution
VM
VM
VM
VM
Z1-us_east
VM
VM
VM
VM
Z2-eu_west
VM
VM
VM
VM
Z3-asia
VM
VM
VM
VM
Z4-us_west
VM
VM
Z5-eu_east
VM
VMTest
Test
Test
Test
Test
Test
Test
Cluster-wide Distribution
● Launch short-lived Selenium Standalone Server containers
Test
initRemoteWebdriverSession(caps)
Selenium
Container
VM
Selenium
Container
VM
Selenium
Container
VM
Selenium
Container
VM
new RemoteWebDriver(External Endpoint)
Cluster
Manager
● Deploy Selenium containers on-demand
● Let Cluster Manager do the distribution
● Scale is dependent only on cluster’s capacity
● Initialize RemoteWebDriver based on deployment endpoint (IP:Port)
A library for creating short-lived testing endpoints over container clusters
● Tests Integration - Programmatically launch test environment
● Plugable Cluster Management Systems
● Cluster-based Scaling
Ephemerals
Test
Container
VM
Cluster
Manager
Ephemerals
API
Container
VM
Container
VM
Container
VM
Deploy
External Endpointnew Object(External Endpoint)
LivePersonInc/ephem
erals
Container Orchestration using Google Container Engine (GKE) and Kubernetes
● Healthchecks (Liveness/Readiness Probes)
● Self-healing mechanisms
● Autoscaling
● Resource Monitoring
Use Case: GCP, Kubernetes and Docker
Test
Ephemerals
API
Deploy
External IP:4444RemoteWebDriver(External IP:4444)
getRemoteWebDriver
(capabilities)
@Rule
public EphemeralResource<RemoteWebDriver> seleniumResource =
new EphemeralResource(
new SeleniumEphemeral.Builder(deploymentContext)
.withDesiredCapabilities(FIREFOX_42)
.build());
@Test
public void test() {
RemoteWebDriver remoteWebDriver = seleniumResource.get();
.....
}
Ephemerals - JUnit Integration
DEMO
github.com/LivePersonInc/ephemerals
● Deploy safely, Just like your software
● No Assumptions - Failure Awareness
● Avoid or protect SPOFs
● Containerization - Short lived, throwaway instances
● Cluster-based Orchestration
● Cloud is the Future - also for Testing
Takeaways

Reliability Patterns for Large-Scale Automated Tests

  • 1.
  • 2.
    Waseem Hamshawi Senior AutomationEngineer, LivePerson waseemh@liveperson.com github.com/waseemh linkedin.com/in/waseem-hamshawi
  • 3.
    Liveperson transforms theconnection between brands and consumers 200BN API calls per month 7 Data Centers (worldwide) 6000 Virtual Servers 3BN Visits Per month
  • 4.
  • 5.
  • 6.
    ● Scale goesup → More deployments ● Minimize risk when modifying tests codebase ● Apply software deployment principals to test automation Reliable Deployments Pre-Staging Staging Production (SSOT) Failure rate > 3 % Failure rate > 3 % Failure rate > 3 % Rollout Rollout ● Multi-tier phased deployment ● Rollout criteria: failure rate < x% for [0,t] time range ● Feature Flags
  • 7.
    ● Rule ofThumb: External access points and APIs will fail ● Test code should be tolerance to external failures ● Provide Fallbacks to every access point getRemoteWebDriver() Primary Grid Secondary Grid getFallback() If primary fail Single Point of Failures ● Selenium Grid - Single point of failure Secondary Hub/Grid failover
  • 8.
    Single Point ofFailures getRemoteWebDriver() Grid 1 Grid 2 new RemoteWebDriver() getFallback() If primary or secondary fail Grid Load Balancing Grid 3 Grid N . . Load Balancer ● Breakdown large Grids into smaller ones ● Redundant, highly available Grid setup ● Monitor and load balance ● Open source LB: Ribbon (github.com/Netflix/ribbon)
  • 9.
    Circuit Breakers ● Don’tbeat a dead horse ● Prevent unnecessary timed-out requests ● Shed load on external access point ● Failure Handling Open source libraries: ○ Hystrix (github.com/Netflix/Hystrix) ○ Failsafe (github.com/jhalterman/failsafe) Single Point of Failures OPENCLOSED HALF OPEN fail [threshold reached] Open Circuit timeout reached fail success success Fallback
  • 10.
  • 11.
  • 12.
    Cluster-wide Distribution ● Launchshort-lived Selenium Standalone Server containers Test initRemoteWebdriverSession(caps) Selenium Container VM Selenium Container VM Selenium Container VM Selenium Container VM new RemoteWebDriver(External Endpoint) Cluster Manager ● Deploy Selenium containers on-demand ● Let Cluster Manager do the distribution ● Scale is dependent only on cluster’s capacity ● Initialize RemoteWebDriver based on deployment endpoint (IP:Port)
  • 13.
    A library forcreating short-lived testing endpoints over container clusters ● Tests Integration - Programmatically launch test environment ● Plugable Cluster Management Systems ● Cluster-based Scaling Ephemerals Test Container VM Cluster Manager Ephemerals API Container VM Container VM Container VM Deploy External Endpointnew Object(External Endpoint) LivePersonInc/ephem erals
  • 14.
    Container Orchestration usingGoogle Container Engine (GKE) and Kubernetes ● Healthchecks (Liveness/Readiness Probes) ● Self-healing mechanisms ● Autoscaling ● Resource Monitoring Use Case: GCP, Kubernetes and Docker Test Ephemerals API Deploy External IP:4444RemoteWebDriver(External IP:4444) getRemoteWebDriver (capabilities)
  • 15.
    @Rule public EphemeralResource<RemoteWebDriver> seleniumResource= new EphemeralResource( new SeleniumEphemeral.Builder(deploymentContext) .withDesiredCapabilities(FIREFOX_42) .build()); @Test public void test() { RemoteWebDriver remoteWebDriver = seleniumResource.get(); ..... } Ephemerals - JUnit Integration
  • 16.
  • 17.
  • 18.
    ● Deploy safely,Just like your software ● No Assumptions - Failure Awareness ● Avoid or protect SPOFs ● Containerization - Short lived, throwaway instances ● Cluster-based Orchestration ● Cloud is the Future - also for Testing Takeaways