4. IN NUMBERS
4.0m+ 30,000
Funded 140 bets placed
Accounts locations one minute
120,000+ £288m
requests per funds on £2.2bn
second deposit Mobile FY12
4
35. BAD STUFF HAPPENS! SO
PREPARE FOR FAILURE
EVERY LAYER MATTERS
INFRASTRUCTURE EVOLVES AT A
SLOWER RATE THAN CODE
YOU HAVE TO CARE
35
36. THANK YOU (REALLY THIS TIME!)
Martin Anderson @mdjanderson
Abraham Ingersoll @aberoham
http://betfair.jobs
36
Editor's Notes
Wow, we didn’t expect quite so many people for the graveyard shift – so thanks for coming!For those of you who have had your minds blown with topics like the “Mysteries of CDNs” and the “Google Compute engine” this is not one of those talks!As you can guess from the title, this is a fairly lighthearted look at some of our experiences, especially the unexpected surprises, of developing a brand new website for our company and how sometimes, the decisions that one of us makes is in direct conflict with what the other one ideally wants.
As you’;; have guess from the slide branding, we both work for Betfair who are one of the worlds largest betting companies
We have a lot of products but the main one that we are known for is the betting exchange. Unlike a normal book maker, where you can only back an outcome like I want , you are able to lay it too. Laying is just effectively taking a back bet from another person.Size wise, we do a fair bit of busiess
This all comes from a volume of bets that exceeds the combined volumes from all the stock exchanges in Europe combined.My favourite is that 20% of customers admitted that they have used their mobile to bet at a weddingWe are practically a bank - we deal with massive volumes of money so people are very interested in our site staying up, being secure and being fastThe company has development centers in the UK, US, Portugal, Romania and Aus. We have a whole host of products, not just the exchange and of course our products have very strict rules from regulatorsThere is a massive amount of complexity
M – To give you an idea of what we were looking to improve, here is an agonizing graph to show just how lightning fast our old site wasWe’re measuring this via Keynote:Internet Explorer 8From locations that represent 70% of our customer baseOver last-mile DSL connectionsMeasuring the download and execution time of every asset
M – Betfair was in the process of building a new website. During the previous few years the company had grown massively and the old one allowed us to scale with this demand. But there came a point when we wanted to present out users with a website that gave them a world class experience of great performance, operational monitoring, SEO, customer analytics, easy deliverability and the capacity for A/B testing baked in. The company brought in some guys who had done a similar job at Shopzilla and I joined their team. Knowing the importance of how quickly we could delivery this and keep delivering meant that we would have to do things differently and that we had to move forwards in a more DevOps style by having operations guys embedded within our teams.So we sat down and thought – “the one thing that is missing here is an angry american guy”A -
What the new site is ---
No it wasn’t . We worked incredibly hard to create this new website and make it outstanding but the reality is that there are always going to be events that blindside you.We had a whole range of things that went on that we never would have expected, across Performance, Operational Monitoring, Resilience and Release Process so here are a few of them
A – LeadsExplain the new web site architecture – moving from client side rendering to serverside. Twitter announced at the last velocity conference that they are doing the same thing.JVM based.M - The new website puts an emphasis on a load of cool things including performance, SEO and A/B testing out of the box. It’s very flexible since it takes a very modular approach so we have a single page with dozens of separate javascript and css filesOf course one of the first things we did was to bundle and optimise these assets using WRO4J, an awesome little open source library that transforms those files using a range of tools like Google Closure Compressor, Less and others via the Rhino JavaScript engine. We initially started using this with the bundles being created at compile time but this meant builds started getting longer and longer as more modules were created. Also, the A/B tests means that we potentially have an enormous potential number of combinations of files so we decided to do this at runtime with the name of the final bundle being a aggregation of the processed files. Unfortunately this process is rather slow and a generating single complex bundle can take up to 6 seconds. But since we were using a CDN, we would be protected from the sheer volume of users requests. It also meant that we could squeeze the absolute most of of these optimization processes and not worry about how long they took.Right?M – there was a bug in the naming strategy (loaded order via the request not alphabetic) we used which meant that rather than having a single canonical version, each server could have it’s own name which exacerbated the issue.
A - LeadsM - The new website puts an emphasis on a load of cool things including performance, SEO and A/B testing out of the box. It’s very flexible since it takes a very modular approach so we have a single page with dozens of separate javascript and css filesOf course one of the first things we did was to bundle and optimise these assets using WRO4J, an awesome little open source library that transforms those files using a range of tools like Google Closure Compressor, Less and others via the Rhino JavaScript engine. We initially started using this with the bundles being created at compile time but this meant builds started getting longer and longer as more modules were created. Also, the A/B tests means that we potentially have an enormous potential number of combinations of files so we decided to do this at runtime with the name of the final bundle being a aggregation of the processed files. Unfortunately this process is rather slow and a generating single complex bundle can take up to 6 seconds. But since we were using a CDN, we would be protected from the sheer volume of users requests. It also meant that we could squeeze the absolute most of of these optimization processes and not worry about how long they took.Right?M – there was a bug in the naming strategy (loaded order via the request not alphabetic) we used which meant that rather than having a single canonical version, each server could have it’s own name which exacerbated the issue.
M - LeadsM – You saw in one of the earlier slides, our old site was hardly a speed machine and one of our main reasons for moving to the new web platform was that we wanted better performance. The devs looked at a whole range of metrics, including our full page load times and time to first byte. When our web platform gets a request, it issues a load of requests to underlying services in parallel. It starts rendering html as soon as it has any data. We were so proud that the server side time to first byte was about 50ms and the client side full page load was about 3 seconds.But when we tried the site, it didn’t feel fast. It was obviously better than the old site but nowhere near as fast as it should have been.A – M – and this was the result of Abe changing a single character in the load balancer config.
M - LeadsM – You saw in one of the earlier slides, our old site was hardly a speed machine and one of our main reasons for moving to the new web platform was that we wanted better performance. The devs looked at a whole range of metrics, including our full page load times and time to first byte. When our web platform gets a request, it issues a load of requests to underlying services in parallel. It starts rendering html as soon as it has any data. We were so proud that the server side time to first byte was about 50ms and the client side full page load was about 3 seconds.But when we tried the site, it didn’t feel fast. It was obviously better than the old site but nowhere near as fast as it should have been.A – M – and this was the result of Abe changing a single character in the load balancer config.
M - LeadsM - Because the old site was stitched together on the client side. The underlying network architecture reflected this. It treated every request as hostile and routed them through the same network infrastructure.M - So what you have here is 1 massively powerful and high IO website yelling at a huge set of very high IO data services. And in the middle is a firewall. In fact a single firewall device.We didn't involve Networks enough
M - LeadsM - Because the old site was stitched together on the client side. The underlying network architecture reflected this. It treated every request as hostile and routed them through the same network infrastructure.M - So what you have here is 1 massively powerful and high IO website yelling at a huge set of very high IO data services. And in the middle is a firewall. In fact a single firewall device.We didn't involve Networks enough
M - LeadsHow to turn a racing car into a lada – use Andy’s own pictures
M - LeadsHow to turn a racing car into a lada – use Andy’s own pictures
M - LeadsHow to turn a racing car into a lada – use Andy’s own pictures
Abe - LeadsSo the business came to us with a requirement that required some persistence in the web tier. We’d been putting that off for some time since a stateless web application is far easier to scale than a stateful one but they wanted it and they wanted it right now.We kind of had three options:Use our standard peristence technology. Oracle – pros – bullet proof reliabilty (we use it for all our transactional data), well understood in the company, we know how to make it scale cons – licensing costs aside, the delivery overhead would be huge. The impact in development and testing would slow us down enormously and this was the absolute antithesis of what the business wantedUse something else that the company already supported and would fit well into the delivery process even if it was not a perfect fit.There is of course a third option – go away and find the perfect tool but we were embracing risk here! Delivery early even if it was not perfect. No one was going to wait around until we built or found this technology.So we went for option 2 with the plan that it would work well enough for us to go away and evaluate the perfect solution (memcached, coherence, twemcache, mongo or couchbase or other)
Abe - LeadsSo the business came to us with a requirement that required some persistence in the web tier. We’d been putting that off for some time since a stateless web application is far easier to scale than a stateful one but they wanted it and they wanted it right now.We kind of had three options:Use our standard peristence technology. Oracle – pros – bullet proof reliabilty (we use it for all our transactional data), well understood in the company, we know how to make it scale cons – licensing costs aside, the delivery overhead would be huge. The impact in development and testing would slow us down enormously and this was the absolute antithesis of what the business wantedUse something else that the company already supported and would fit well into the delivery process even if it was not a perfect fit.There is of course a third option – go away and find the perfect tool but we were embracing risk here! Delivery early even if it was not perfect. No one was going to wait around until we built or found this technology.So we went for option 2 with the plan that it would work well enough for us to go away and evaluate the perfect solution (memcached, coherence, twemcache, mongo or couchbase or other)
Abe - LeadsSo the business came to us with a requirement that required some persistence in the web tier. We’d been putting that off for some time since a stateless web application is far easier to scale than a stateful one but they wanted it and they wanted it right now.We kind of had three options:Use our standard peristence technology. Oracle – pros – bullet proof reliabilty (we use it for all our transactional data), well understood in the company, we know how to make it scale cons – licensing costs aside, the delivery overhead would be huge. The impact in development and testing would slow us down enormously and this was the absolute antithesis of what the business wantedUse something else that the company already supported and would fit well into the delivery process even if it was not a perfect fit.There is of course a third option – go away and find the perfect tool but we were embracing risk here! Delivery early even if it was not perfect. No one was going to wait around until we built or found this technology.So we went for option 2 with the plan that it would work well enough for us to go away and evaluate the perfect solution (memcached, coherence, twemcache, mongo or couchbase or other)
Abe - LeadsSo the business came to us with a requirement that required some persistence in the web tier. We’d been putting that off for some time since a stateless web application is far easier to scale than a stateful one but they wanted it and they wanted it right now.We kind of had three options:Use our standard peristence technology. Oracle – pros – bullet proof reliabilty (we use it for all our transactional data), well understood in the company, we know how to make it scale cons – licensing costs aside, the delivery overhead would be huge. The impact in development and testing would slow us down enormously and this was the absolute antithesis of what the business wantedUse something else that the company already supported and would fit well into the delivery process even if it was not a perfect fit.There is of course a third option – go away and find the perfect tool but we were embracing risk here! Delivery early even if it was not perfect. No one was going to wait around until we built or found this technology.So we went for option 2 with the plan that it would work well enough for us to go away and evaluate the perfect solution (memcached, coherence, twemcache, mongo or couchbase or other)
M – As we started to take over the full site, we realised that we needed some routing layer above all the applicationsTraffic TsunamisEvent events eventually smash usLog compressions at 4amJitter is your friend and Kelvin quoteKeynote quote? “An average of an average only works if the distribution is standard. Web performance is never standard”----- Meeting Notes (01/10/2012 16:46) -----Jitter in applicationAccidentally queue network packets
M – As we started to take over the full site, we realised that we needed some routing layer above all the applicationsTraffic TsunamisEvent events eventually smash usLog compressions at 4amJitter is your friend and Kelvin quoteKeynote quote? “An average of an average only works if the distribution is standard. Web performance is never standard”----- Meeting Notes (01/10/2012 16:46) -----Jitter in applicationAccidentally queue network packets
M – As we started to take over the full site, we realised that we needed some routing layer above all the applicationsTraffic TsunamisEvent events eventually smash usLog compressions at 4amJitter is your friend and Kelvin quoteKeynote quote? “An average of an average only works if the distribution is standard. Web performance is never standard”----- Meeting Notes (01/10/2012 16:46) -----Jitter in applicationAccidentally queue network packets
M – As we started to take over the full site, we realised that we needed some routing layer above all the applicationsTraffic TsunamisEvent events eventually smash usLog compressions at 4amJitter is your friend and Kelvin quoteKeynote quote? “An average of an average only works if the distribution is standard. Web performance is never standard”----- Meeting Notes (01/10/2012 16:46) -----Jitter in applicationAccidentally queue network packets
Have any of you guys heard of The Quarterback problem?
The Prius Effect----- Meeting Notes (01/10/2012 16:46) -----Added complexity of config outsie of env----- Meeting Notes (01/10/2012 17:03) -----But did it work?If we hadn't snuggled up, we never would have done this. We've tried this beforeWe've overcome every obstacle that's come up and that's only because we've worked together
The Prius Effect----- Meeting Notes (01/10/2012 16:46) -----Added complexity of config outsie of env----- Meeting Notes (01/10/2012 17:03) -----But did it work?If we hadn't snuggled up, we never would have done this. We've tried this beforeWe've overcome every obstacle that's come up and that's only because we've worked together