Spy v Spy - Treachery in the Dev/Ops Trenches

•Download as PPTX, PDF•

1 like•1,048 views

bloodredsun

Talk from myself and Abe Ingersoll for Velocity Europe 2012

Technology

SPY V SPY - TREACHERY IN THE
DEV/OPS TRENCHES
Martin Anderson
Abraham Ingersoll

IN NUMBERS

4.0m+ 30,000
Funded 140 bets placed
Accounts locations one minute

120,000+ £288m
requests per funds on £2.2bn
second deposit Mobile FY12

4

WHO DO YOU WANT TO BUILD A BETTER WEBSITE

6

OPERATIONS MAGIC: AN ORDER OF MAGNITUDE
FASTER WITH JUST ONE BIT

13

OPERATIONS MAGIC: AN ORDER OF MAGNITUDE
FASTER WITH JUST ONE BIT

14

MONITORING HIGH PERFORMANCE

Photo: itwasntandy

18

THUNDERING HERDS FROM ABOVE AND BELOW

26

TESTING IN PRODUCTION

CC image courtesy wikipedia

32

TESTING IN PRODUCTION

CC image courtesy wikipedia

33

BAD STUFF HAPPENS! SO
PREPARE FOR FAILURE
EVERY LAYER MATTERS
INFRASTRUCTURE EVOLVES AT A
SLOWER RATE THAN CODE
YOU HAVE TO CARE

35

THANK YOU (REALLY THIS TIME!)
Martin Anderson @mdjanderson
Abraham Ingersoll @aberoham

http://betfair.jobs
36

Recently uploaded

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Sample pptx for embedding into website for demoHarshalMandlekar2

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

From Family Reminiscence to Scholarly Archive .Alan Dix

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

What is Artificial Intelligence?????????blackmambaettijean

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

"ML in Production",Oleksandr BaganFwdays

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx

Sample pptx for embedding into website for demo

The Ultimate Guide to Choosing WordPress Pros and Cons

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

From Family Reminiscence to Scholarly Archive .

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Ensuring Technical Readiness For Copilot in Microsoft 365

Generative AI for Technical Writer or Information Developers

Anypoint Exchange: It’s Not Just a Repo!

Developer Data Modeling Mistakes: From Postgres to NoSQL

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

What is Artificial Intelligence?????????

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

"Debugging python applications inside k8s environment", Andrii Soldatenko

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Gen AI in Business - Global Trends Report 2024.pdf

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

What is DBT - The Ultimate Data Build Tool.pdf

"ML in Production",Oleksandr Bagan

Featured

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Featured (20)

2024 State of Marketing Report – by Hubspot

Everything You Need To Know About ChatGPT

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Spy v Spy - Treachery in the Dev/Ops Trenches

1. SPY V SPY - TREACHERY IN THE DEV/OPS TRENCHES Martin Anderson Abraham Ingersoll

2. 2

3. WHAT WE ARE 3

4. IN NUMBERS 4.0m+ 30,000 Funded 140 bets placed Accounts locations one minute 120,000+ £288m requests per funds on £2.2bn second deposit Mobile FY12 4

5. THE OLD SITE 5

6. WHO DO YOU WANT TO BUILD A BETTER WEBSITE 6

7. WE DID IT! 7

8. THANK YOU 8

9. HOLD ON! – WAS IT PLAIN SAILING? 9

10. PERFORMANCE 10

11. THE JAVAGATOR 11

12. THE JAVAGATOR 12

13. OPERATIONS MAGIC: AN ORDER OF MAGNITUDE FASTER WITH JUST ONE BIT 13

14. OPERATIONS MAGIC: AN ORDER OF MAGNITUDE FASTER WITH JUST ONE BIT 14

15. FIREWALLS AND FIRE-BREATHERS 15

16. FIREWALLS AND FIRE-BREATHERS 16

17. OPERATIONAL MONITORING 17

18. MONITORING HIGH PERFORMANCE Photo: itwasntandy 18

19. OVER-MONITORING HIGH PERFORMANCE 19

20. NOT SO HIGH PERFORMANCE 20

21. RESILIENCE 21

22. WEB TIER PERSISTENCE 22

23. INTRODUCING NOSQL 23

24. INTRODUCING NOSQL 24

25. INTRODUCING NOSQL 25

26. THUNDERING HERDS FROM ABOVE AND BELOW 26

27. THUNDERING HERDS FROM ABOVE 27

28. THUNDERING HERDS FROM ABOVE 28

29. THUNDERING HERDS FROM BELOW 29

30. DELIVERY PROCESS 30

31. TESTING IN PRODUCTION 31

32. TESTING IN PRODUCTION CC image courtesy wikipedia 32

33. TESTING IN PRODUCTION CC image courtesy wikipedia 33

34. SO WHAT DID WE LEARN? 34

35. BAD STUFF HAPPENS! SO PREPARE FOR FAILURE EVERY LAYER MATTERS INFRASTRUCTURE EVOLVES AT A SLOWER RATE THAN CODE YOU HAVE TO CARE 35

36. THANK YOU (REALLY THIS TIME!) Martin Anderson @mdjanderson Abraham Ingersoll @aberoham http://betfair.jobs 36

Editor's Notes

Wow, we didn’t expect quite so many people for the graveyard shift – so thanks for coming!For those of you who have had your minds blown with topics like the “Mysteries of CDNs” and the “Google Compute engine” this is not one of those talks!As you can guess from the title, this is a fairly lighthearted look at some of our experiences, especially the unexpected surprises, of developing a brand new website for our company and how sometimes, the decisions that one of us makes is in direct conflict with what the other one ideally wants.
As you’;; have guess from the slide branding, we both work for Betfair who are one of the worlds largest betting companies
We have a lot of products but the main one that we are known for is the betting exchange. Unlike a normal book maker, where you can only back an outcome like I want , you are able to lay it too. Laying is just effectively taking a back bet from another person.Size wise, we do a fair bit of busiess
This all comes from a volume of bets that exceeds the combined volumes from all the stock exchanges in Europe combined.My favourite is that 20% of customers admitted that they have used their mobile to bet at a weddingWe are practically a bank - we deal with massive volumes of money so people are very interested in our site staying up, being secure and being fastThe company has development centers in the UK, US, Portugal, Romania and Aus. We have a whole host of products, not just the exchange and of course our products have very strict rules from regulatorsThere is a massive amount of complexity
M – To give you an idea of what we were looking to improve, here is an agonizing graph to show just how lightning fast our old site wasWe’re measuring this via Keynote:Internet Explorer 8From locations that represent 70% of our customer baseOver last-mile DSL connectionsMeasuring the download and execution time of every asset
M – Betfair was in the process of building a new website. During the previous few years the company had grown massively and the old one allowed us to scale with this demand. But there came a point when we wanted to present out users with a website that gave them a world class experience of great performance, operational monitoring, SEO, customer analytics, easy deliverability and the capacity for A/B testing baked in. The company brought in some guys who had done a similar job at Shopzilla and I joined their team. Knowing the importance of how quickly we could delivery this and keep delivering meant that we would have to do things differently and that we had to move forwards in a more DevOps style by having operations guys embedded within our teams.So we sat down and thought – “the one thing that is missing here is an angry american guy”A -
What the new site is ---
No it wasn’t . We worked incredibly hard to create this new website and make it outstanding but the reality is that there are always going to be events that blindside you.We had a whole range of things that went on that we never would have expected, across Performance, Operational Monitoring, Resilience and Release Process so here are a few of them
A – LeadsExplain the new web site architecture – moving from client side rendering to serverside. Twitter announced at the last velocity conference that they are doing the same thing.JVM based.M - The new website puts an emphasis on a load of cool things including performance, SEO and A/B testing out of the box. It’s very flexible since it takes a very modular approach so we have a single page with dozens of separate javascript and css filesOf course one of the first things we did was to bundle and optimise these assets using WRO4J, an awesome little open source library that transforms those files using a range of tools like Google Closure Compressor, Less and others via the Rhino JavaScript engine. We initially started using this with the bundles being created at compile time but this meant builds started getting longer and longer as more modules were created. Also, the A/B tests means that we potentially have an enormous potential number of combinations of files so we decided to do this at runtime with the name of the final bundle being a aggregation of the processed files. Unfortunately this process is rather slow and a generating single complex bundle can take up to 6 seconds. But since we were using a CDN, we would be protected from the sheer volume of users requests. It also meant that we could squeeze the absolute most of of these optimization processes and not worry about how long they took.Right?M – there was a bug in the naming strategy (loaded order via the request not alphabetic) we used which meant that rather than having a single canonical version, each server could have it’s own name which exacerbated the issue.
A - LeadsM - The new website puts an emphasis on a load of cool things including performance, SEO and A/B testing out of the box. It’s very flexible since it takes a very modular approach so we have a single page with dozens of separate javascript and css filesOf course one of the first things we did was to bundle and optimise these assets using WRO4J, an awesome little open source library that transforms those files using a range of tools like Google Closure Compressor, Less and others via the Rhino JavaScript engine. We initially started using this with the bundles being created at compile time but this meant builds started getting longer and longer as more modules were created. Also, the A/B tests means that we potentially have an enormous potential number of combinations of files so we decided to do this at runtime with the name of the final bundle being a aggregation of the processed files. Unfortunately this process is rather slow and a generating single complex bundle can take up to 6 seconds. But since we were using a CDN, we would be protected from the sheer volume of users requests. It also meant that we could squeeze the absolute most of of these optimization processes and not worry about how long they took.Right?M – there was a bug in the naming strategy (loaded order via the request not alphabetic) we used which meant that rather than having a single canonical version, each server could have it’s own name which exacerbated the issue.
M - LeadsM – You saw in one of the earlier slides, our old site was hardly a speed machine and one of our main reasons for moving to the new web platform was that we wanted better performance. The devs looked at a whole range of metrics, including our full page load times and time to first byte. When our web platform gets a request, it issues a load of requests to underlying services in parallel. It starts rendering html as soon as it has any data. We were so proud that the server side time to first byte was about 50ms and the client side full page load was about 3 seconds.But when we tried the site, it didn’t feel fast. It was obviously better than the old site but nowhere near as fast as it should have been.A – M – and this was the result of Abe changing a single character in the load balancer config.
M - LeadsM – You saw in one of the earlier slides, our old site was hardly a speed machine and one of our main reasons for moving to the new web platform was that we wanted better performance. The devs looked at a whole range of metrics, including our full page load times and time to first byte. When our web platform gets a request, it issues a load of requests to underlying services in parallel. It starts rendering html as soon as it has any data. We were so proud that the server side time to first byte was about 50ms and the client side full page load was about 3 seconds.But when we tried the site, it didn’t feel fast. It was obviously better than the old site but nowhere near as fast as it should have been.A – M – and this was the result of Abe changing a single character in the load balancer config.
M - LeadsM - Because the old site was stitched together on the client side. The underlying network architecture reflected this. It treated every request as hostile and routed them through the same network infrastructure.M - So what you have here is 1 massively powerful and high IO website yelling at a huge set of very high IO data services. And in the middle is a firewall. In fact a single firewall device.We didn't involve Networks enough
M - LeadsM - Because the old site was stitched together on the client side. The underlying network architecture reflected this. It treated every request as hostile and routed them through the same network infrastructure.M - So what you have here is 1 massively powerful and high IO website yelling at a huge set of very high IO data services. And in the middle is a firewall. In fact a single firewall device.We didn't involve Networks enough
M - LeadsHow to turn a racing car into a lada – use Andy’s own pictures
M - LeadsHow to turn a racing car into a lada – use Andy’s own pictures
M - LeadsHow to turn a racing car into a lada – use Andy’s own pictures
Abe - LeadsSo the business came to us with a requirement that required some persistence in the web tier. We’d been putting that off for some time since a stateless web application is far easier to scale than a stateful one but they wanted it and they wanted it right now.We kind of had three options:Use our standard peristence technology. Oracle – pros – bullet proof reliabilty (we use it for all our transactional data), well understood in the company, we know how to make it scale cons – licensing costs aside, the delivery overhead would be huge. The impact in development and testing would slow us down enormously and this was the absolute antithesis of what the business wantedUse something else that the company already supported and would fit well into the delivery process even if it was not a perfect fit.There is of course a third option – go away and find the perfect tool but we were embracing risk here! Delivery early even if it was not perfect. No one was going to wait around until we built or found this technology.So we went for option 2 with the plan that it would work well enough for us to go away and evaluate the perfect solution (memcached, coherence, twemcache, mongo or couchbase or other)
Abe - LeadsSo the business came to us with a requirement that required some persistence in the web tier. We’d been putting that off for some time since a stateless web application is far easier to scale than a stateful one but they wanted it and they wanted it right now.We kind of had three options:Use our standard peristence technology. Oracle – pros – bullet proof reliabilty (we use it for all our transactional data), well understood in the company, we know how to make it scale cons – licensing costs aside, the delivery overhead would be huge. The impact in development and testing would slow us down enormously and this was the absolute antithesis of what the business wantedUse something else that the company already supported and would fit well into the delivery process even if it was not a perfect fit.There is of course a third option – go away and find the perfect tool but we were embracing risk here! Delivery early even if it was not perfect. No one was going to wait around until we built or found this technology.So we went for option 2 with the plan that it would work well enough for us to go away and evaluate the perfect solution (memcached, coherence, twemcache, mongo or couchbase or other)
Abe - LeadsSo the business came to us with a requirement that required some persistence in the web tier. We’d been putting that off for some time since a stateless web application is far easier to scale than a stateful one but they wanted it and they wanted it right now.We kind of had three options:Use our standard peristence technology. Oracle – pros – bullet proof reliabilty (we use it for all our transactional data), well understood in the company, we know how to make it scale cons – licensing costs aside, the delivery overhead would be huge. The impact in development and testing would slow us down enormously and this was the absolute antithesis of what the business wantedUse something else that the company already supported and would fit well into the delivery process even if it was not a perfect fit.There is of course a third option – go away and find the perfect tool but we were embracing risk here! Delivery early even if it was not perfect. No one was going to wait around until we built or found this technology.So we went for option 2 with the plan that it would work well enough for us to go away and evaluate the perfect solution (memcached, coherence, twemcache, mongo or couchbase or other)
Abe - LeadsSo the business came to us with a requirement that required some persistence in the web tier. We’d been putting that off for some time since a stateless web application is far easier to scale than a stateful one but they wanted it and they wanted it right now.We kind of had three options:Use our standard peristence technology. Oracle – pros – bullet proof reliabilty (we use it for all our transactional data), well understood in the company, we know how to make it scale cons – licensing costs aside, the delivery overhead would be huge. The impact in development and testing would slow us down enormously and this was the absolute antithesis of what the business wantedUse something else that the company already supported and would fit well into the delivery process even if it was not a perfect fit.There is of course a third option – go away and find the perfect tool but we were embracing risk here! Delivery early even if it was not perfect. No one was going to wait around until we built or found this technology.So we went for option 2 with the plan that it would work well enough for us to go away and evaluate the perfect solution (memcached, coherence, twemcache, mongo or couchbase or other)
M – As we started to take over the full site, we realised that we needed some routing layer above all the applicationsTraffic TsunamisEvent events eventually smash usLog compressions at 4amJitter is your friend and Kelvin quoteKeynote quote? “An average of an average only works if the distribution is standard. Web performance is never standard”----- Meeting Notes (01/10/2012 16:46) -----Jitter in applicationAccidentally queue network packets
M – As we started to take over the full site, we realised that we needed some routing layer above all the applicationsTraffic TsunamisEvent events eventually smash usLog compressions at 4amJitter is your friend and Kelvin quoteKeynote quote? “An average of an average only works if the distribution is standard. Web performance is never standard”----- Meeting Notes (01/10/2012 16:46) -----Jitter in applicationAccidentally queue network packets
M – As we started to take over the full site, we realised that we needed some routing layer above all the applicationsTraffic TsunamisEvent events eventually smash usLog compressions at 4amJitter is your friend and Kelvin quoteKeynote quote? “An average of an average only works if the distribution is standard. Web performance is never standard”----- Meeting Notes (01/10/2012 16:46) -----Jitter in applicationAccidentally queue network packets
M – As we started to take over the full site, we realised that we needed some routing layer above all the applicationsTraffic TsunamisEvent events eventually smash usLog compressions at 4amJitter is your friend and Kelvin quoteKeynote quote? “An average of an average only works if the distribution is standard. Web performance is never standard”----- Meeting Notes (01/10/2012 16:46) -----Jitter in applicationAccidentally queue network packets
Have any of you guys heard of The Quarterback problem?
The Prius Effect----- Meeting Notes (01/10/2012 16:46) -----Added complexity of config outsie of env----- Meeting Notes (01/10/2012 17:03) -----But did it work?If we hadn't snuggled up, we never would have done this. We've tried this beforeWe've overcome every obstacle that's come up and that's only because we've worked together
The Prius Effect----- Meeting Notes (01/10/2012 16:46) -----Added complexity of config outsie of env----- Meeting Notes (01/10/2012 17:03) -----But did it work?If we hadn't snuggled up, we never would have done this. We've tried this beforeWe've overcome every obstacle that's come up and that's only because we've worked together

Spy v Spy - Treachery in the Dev/Ops Trenches

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Spy v Spy - Treachery in the Dev/Ops Trenches

Editor's Notes