Betfair's Site Rebuild: Fast - We promise


Published on

In June 2011 Betfair published a customer commitment to ensure greater transparency and clarity on key aspects of our service, including performance and reliability.

This is our journey so far.

Published in: Technology, Design
1 Comment
1 Like
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Introduction:Tim Morrow, Head of Channels at BetfairIn June of this year Betfair published a customer commitment to ensure greater transparency and clarity on key aspects of our service, including performance and reliability.We published this despite the fact that our existing website was failing to live up to those promisesWe had just embarked on a site rebuild to deliver on those promises. I’m here to tell you about our journey so far.
  • What do we do?Betfair is a betting company.We’re one of the world's largest international online sports betting providers.We have offices in the UK, Ireland, US, Romania, Australia, Malta and Gibraltar.Whether your sport is Football, Tennis, Horse Racing, Greyhounds, Cricket, Cycling, Darts, Bowls, Yachting or Netball... you can find a market to bet on.
  • We’re a little different to most other betting companies. We pioneered the first successful betting exchange in 2000.Our betting exchange allows customers tobet for an outcome – Backing it - or bet against an outcome – Laying it.You choose your own odds, enter your stake. Betfair will match it with other customer’s odds and stakes, or offer it to be matched.Our platform permits trading whereby customers can place bets then profit by closing out the bet at a later stage at more favorable odds. This is especially compelling when betting during a horse race or event as prices are changing 10 times per second.
  • We have a number of channels for placing bets.Desktop website and optimized websites for various mobile devicesNative iPhone, iPad and Android applicationsBet from TVWe have a developer API with an ecosystem of 3rd party products built on it
  • Some facts and figures. Across all our products:We have over 3.7M registered customersThey use our products from over 140 countries and territories.Our web product is translated into 17 languages.We have over £300M of customers funds on depositLast year, £68m was traded on 2010 World Cup Final – that’s a single matchDuring the grand national 30,000 transactions were placed in a single minute.
  • Betfair has enjoyed rapid growth since it was launched in 2000:expanding its product features, customer base and geographic reach.Added complexities due to regulatory compliance.These complexities lead to risks during software deployments that can cause downtime and outagesIts simpler than that: We haven’t invested in website performance.We provide a number of avenues for customers to provide us feedback, whether through our community forums or feedback links.Our customers are happy to provide us their opinions.“during the world cup the site has run like treacle”“The site is frequently slow and freezes too often.”“very slow and glitchy”“I find Betfair slow in refresh and placing bets”
  • In June 2011 we published our customer commitment: to ensure greater transparency and clarity on key aspects of our service.It covers five major topics. The fifth relates to Our Product & Technology.We committed to maintain targets around quality and reliability of our products and services.* A target uptime of 99.9%* A target full page load time of 3 seconds for 95 percent of customers with average bandwidth.
  • We started publishing our monthly progress. This makes me cringe. 14s full page load for a landing page.Those wild see-sawing are actually related to partial failures on the page; assets timing out.We’re measuring this via Keynote:Internet Explorer 8From locations that represent 70% of our customer baseOver last-mile DSL connectionsMeasuring the download and execution time of every asset
  • We wanted to go back to basics. Much like Maslow’s hierarchy of human needs, we defined our website’s Hierarchy of Needs.At the bottom are the most basic level of needs: Necessities. These must be met before we can pursue higher level needs.It works: Our site is available. Customers can log in and place bets. No “closed” signs on the front door. We are open for business 24/7.It’s fast: This is a basic need. The other stuff doesn’t matter if your customers are giving up in frustration.Its useful: OK, so now we can start optimizing for utility; shortening user journeys, optimizing the checkout process, improving your search featureIts cool: Only then should you do the cool stuff, the flash, the confectionary.
  • We defined out plan. It was to deliver incrementally. We had to show progress.We opted to build a vertical slice of functionality; a fully functioning experience for one of the most popular sports: Football. Customers would be able to log in and bet.April: Proof of concept; August: We made it publicly accessbile; September: We invited customers to use the new site, gathered feedbackOctober: We began to divert customers onto the new site. We build a “throttle” that allows us to control how many customers we test into the new experience.We plan to expand the product; translated into all 17 languages; offering the same breadth of sports.Finally we can begin to deliver new features. 
  • Jump in to look at our stack. Its pretty typical:Decoupled the front-end web application from back-end web services.Allows us to independently test each component and layer, to ensure they operate within SLAWe also performance test the entire stack, simulating production loads and stressing and soaking the systemsOur goal of performance testing really just to assert a variance of each build over a baseline; the most realistic environment is… production.
  • >> We decompose each request into the data fetches required to render the responseWe dispatch those concurrently and insert Future results into the model>> Controller hands off to the framework to render the view>> The view layer begins rendering immediately, blocking on data only when required and flushing completed responsesBy ensuring that there are only low-latency data requests required to render the visual header, we can improve perceived page load time.
  • Performance Optimizations.I don’t need to go into too much detail; you can read all this stuff for yourself. In a nutshell, its all about:Making fewer RequestsFor less stuffThen refine – browser optimizations, user experience optimizationsReducingHTTP Requests:Javascript combo-loading; CSS combined; image spritesMinify and compress all assets: JS, CSSSplit requests across domains: Separate domains for JS, CSS, product imagesUse a CDN hostname on a separate domain from the main page to avoid shipping cookies on asset requests; particular bad for us as our legacy apps on same domain write a TON of cookiesFlush the buffer early to allow the browser to start rendering soonerDefer loading of certain content that’s not needed for initial page view.
  • You need an SLA; Get it agreed with your product stakeholders; We published ours in our customer commitment.It ensures performance engineering receives priority, over and above some other feature as you can point to it and measure against it.This is ours (actually, its more detailed than this, you can read it online):Write it down, put it somewhere accessible, whether that's on a wiki or on a wall.The devil is in the details – Bandwidth? Where are most of your customers? What’s average? For us, it seems in the UK 4Mbs is common now.What are the most common browsers? IE8+, Firefox, Chrome for us.
  • So how do we measure?* Get a webpagetest private instance: It provides a consistent mechanism to test your developed applications simulating bandwidth in real browsers: Enables a very short develop, test, measure cycle.In production:Synthetic clients: Real Browser, last-mile on real connections from around the world. Provides a consistent measurement that can be used to identify trends and availability issues.Real User measurements: Instrument your pages and beacon the results back to your analytics tool. Your analytics library might provide this already. We use Navigation Timing Spec where available or fallback to cookies
  • Logs offer rich seams of information. We use Splunk to aggregate and index or data. It allows us to access all logs from all hosts in a single place and join data together.Your Access logs may already provide valuable information like server response times and payload sizesWe record performance information in our web logs and can mine the data to identify poorly performing requests. Each unique web requests generates a correlation id that we pass to all dependent services so that we can trace log entries across every tier. This is great for going back in time or correlating data between independent services.
  • Obligatory graph slide; we’ve implementedOpenTSDB to capture time series data.There are lots of different ways of consuming the data; we can import log files, poll JMX statistics or receive events.We can combine many different types of metrics from our applications, operating systems and networking infrastructure:Http request rates and latencies across our entire estateHigher-level event rates like markets opened/closed, bets placedCache sizes and hits or missesQueue sizes and produce or consume ratesWe can then mix and match on a single graph, segmenting by clusters, hosts etc.It invaluable for spotting trends and correlations
  • More on performance logs:Consider a page that has numerous server-side requests to assemble the data.>> Each of these calls are executed concurrently where there are no dependencies.For a sampling of requests we capturing timing information and emit it as log events.>> In order to visualize it, we transform it into HAR format. We can then utilize any HAR viewer to render the graph; allows us to identify which calls are are being made in what order and what’s taking the most time.We’re working on a utility to automatically produce these graphs for slow running requests as a way of identifying bottlenecks.
  • I mentioned earlier that we flush the buffer early. We do this to emit the complete visual header to start rendering soonerItem 1 is the initial HTTP request/response. Why is TTFB – the green color – so long and content download short?Answer:Turns out a load balancer is buffering the response and only sending the data when it has 8Kb. We worked with our Citrix engineers to identify the correct behavior and issue the fix.It was literally a one line change.
  • We actually only fixed this yesterday.Now, rather than first byte essentially being delayed by the entireserver-side response time, its must shorter.On slower pages it has pulled in start-render time.Its just something important to keep in mind: There’s lots off stuff between your users and your web servers which may be conspiring against you. Need to examine the whole stack.
  • Here’s a landing page comparison measured from IE8 in London over DSL line, full page load time.Top = old landing page; Bottom = 12 or so seconds to 3 / 3.5 seconds. So we’re not quite there yet.
  • Our customers took notice too. Remember that earlier feedback?
  • But what does this mean for our business?Sessions that bet. This is a key conversion rate for us. Essentially those who purchased something from us.Up 2 percentage points. That has a direct correlation to revenue.Bounce rate for our landing page improved by 40%. Perhaps that’s the appeal of a “beta” site. Maybe there’s a lesson there.Twice as many page views per visitWhile we offer an opt-out link, only 1% have chosen to do so.
  • Performance improvements are tangible:* You can measure itCustomers can feel itPerformance improves the bottom lineWe’re not done – and will never be done. It’s a continual effort.
  • That’s the end of this chapter of our journey.Thank you for allowing me the time to share it.
  • Betfair's Site Rebuild: Fast - We promise

    2. 2. THE EXCHANGE
    3. 3. OUR PRODUCTS
    4. 4. INTERESTING STATS 3.7M+ 140 17 registered locations languages customers £300M £68M 30,000 funds on 2010 World bets placed deposit Cup Final one minute
    5. 5. WHAT’S THE PROBLEM? We haven’t invested in website performance “during the world cup the site has run like treacle” “the site is frequently slow and freezes too often” “very slow and glitchy” “I find Betfair slow”
    8. 8. HIERARCHY OF WEBSITE NEEDS It’s Cool It’s Useful It’s Fast It Works
    9. 9. OUR PLAN• Deliver incrementally• Build a complete slice of functionality• Broaden the offering First Public Diverted Delivery Customers More SportsPrivate POC Invitations to All languages Customers New Features April August September October November December January 2011 2012
    10. 10. TECHNOLOGY / ARCHITECTURE Testable Layer Testable Layer
    11. 11. CONCURRENCY MODEL dispatch service calls rendering controller’s job starts early done
    12. 12. PERFORMANCE OPTIMIZATIONS • Reduce HTTP requests • Minify and compress assets • Split requests across domains • Use cookie-free CDN hostnames • Flush the buffer early • Defer loading of content • Etc…
    13. 13. SLAS• 3 second full page load• For 95% of customers• Under peak loads• With no errors• Over typical bandwidth• From common browsers
    15. 15. LOG MININGAccess logs: User agents, responsetimes, payload sizesPerformance logs: Timings of code paths anddependent service callsCorrelation IDs to trace requests across tiers
    16. 16. GRAPHS
    18. 18. FLUSHING THE BUFFER Networking device thwarting flush() Magic setting: httpcompressonpush=1
    19. 19. RESULT
    21. 21. CUSTOMER FEEDBACK “Faster” “Seems faster and more user friendly” “page loaded quicker” “The pages seem to load a lot faster”
    22. 22. RESULTS • Sessions that bet: up 2 percentage points • Bounce rate: improved 40% • Page views: 2x • 1% opt-out rate
    23. 23. WHAT HAVE WE LEARNED?• Performance improvements are tangible• Measurably improves the bottom line• There’s lots more to do