So who are SRE? The SRE team in PPB provide skills and services to the whole company. So say If an issue or a new product comes along we consult with dev teams and operations to ensure everything is running as best as it can
We own the monitoring and alerting tools in PPB so that’s tools like sensu, opentsdb, grafana and of course splunk cloud.
We automate as much as we can in PPB. So we ensure all our tools can be deployed via pipelines in PPB
So we’re a FTSE100 company, PPB are brands most people in the UK and Ireland are familiar with but we are part of a bigger global entity that is newly christened called Flutter. Flutter is to PPB similar twhat alphabet is to google We have our main commercial centers in Dublin, Melbourne , London and New York Our software development is in Porto, Cluj, London, Edinburgh and Melbourne Operations run out of Malta, Dublin, Melbourne and New Jersey And we have a retail outlets in Uk, ROI and newly in the USA
A quick overview of our websites for anyone who is not familiar with them,. Betfair.com is a Gaming company with two distinct offerings – Exchange and Sportsbook The Exchange allows you to bet what will happen or what won’t happen. So essentially you can be the bookmaker or the punter So Customers bet against each other on the outcomes of sporting events. They Can get better prices that the sportsbooks – 20% on average higher competition and lower margins Bets get matched on the exchange = a backer (will happen) and a layer (won’t happen) at the same odds Blue = backers and Pink = layers £100 bet @ 2.2 = backer gets £120 plus their stake back. Layer keeps their stake if they win and has a liability of $120 for the bet.
Paddypower is the more traditional betting format most people would know, your placing a bet on a sporting outcome directly against PaddyPower We offer markets on hundreds of different sports from horse racing and football all the way to tiddlywinks and also have games, casino and bingo sites We also offer some novelty markets, currently we’ve markets on the winners of Love Island who I’m sure we’re all fans.
PP and BF also have great mobile apps for android and apple so feel free to check the out
PP and BF merged 3 years ago becoming one of the worlds biggest gambling company 2 very similar companies using lots of different tools to do the same job so we knew there were lots of synergies to be had. Right across the board we reviewed the tools we were using in both companies and picking the best tool in each area based on requirements for what we needed going forward as one entity Regarding splunk PP already had a splunk enterprise solution while BF was using another provider.
Initially we only needed this solution for Dev teans and IT Ops but quickly other teams started got to see the benefits of having Splunk in the Org.
So what made us chose splunk? Its hosted in the cloud by splunk so we don’t have to worry about the day to day management of the application, this has freed up our time to work on other things There is lots of free training and support resources on the web Any issues we have a dedicated Customer Success Manager Gavin who seems to be online 24/7 Very easy to get data into splunk Splunk fits in very easy with out automated pipelines, we have 1000’s of VM’s that we manage through pipelines. Integrates well with email, slack and pagerduty if your seting up alerts in splunk User management integrates easily with out CORP AD so all users are managed from AD not on splunk.
Over the past 24 months you can see ingest has been growing steadily as more teams onboard into splunk within PPB. We have gone from 1Tb daily ingest 24 months ago to an average now of about 7TB a day ingest, the nature of sports mean we get very busy at weekends and big sporting events so we regularly are seeing 10TB ingest on a busy Saturday and our record was 13.3TB ingest on Aintree GN this year Because our ingest does fluctuate we needed to be sure Splunk can work as effectively and fast when ingesting 7TB or 13TB and happy to say it does
The architecture behind our websites is extremely complex, we run everything as microservices that all interact with man other microservices This slide is just a simple example of our cashout app FCQ, you can see it interacts with 7 other microservices which in turn are interacting with many others If an issue occurs in an application, the root cause could in fact be in another service. Splunk is excellent at helping us pick through this complexity and identifying issues very quikly
The architecture of our splunk cloud in a very simplified diagram can be seen here We have multiple datacenters sending logs to splunk. Each DC has a layer of intermediate heavy forwarders that everything below proxies through then onwards to splunk cloud The heavy forwarders are very powerful, for example we can apply config settings related to log formats, parse out unwanted data and throttle our bandwidth if ever needed here
What we haven’t shown here is we are also ingesting logs from applications we are running AWS, GCP and azure directly into splunk cloud using Splunk Apps installed on the search head
n the Darkweb it’s very easy for hackers to get their hands on millions of user accounts and also get lists of passwords. They will use these to try and access accounts on multiple sites, including our own. It’s a daily occurrence. Protecting customers accounts is very important for us. Our Fraud team are one of the most active users of the Splunk search interface in PPB. They use it to produce reports that list accounts with a high number of failed logins, as there’s no latency the data is the most recent. Which is important. The reports are then uploaded to another tool that applies a set of rules, checks and filters to produce a list of accounts that need further investigation. Splunk is then used to check if Any accounts were hacked, if they were they they’re suspended Then they check if any Fraud took place. If it did funds are blocked from moving and the customer balance is restored. There are some very good frauds that take place on accounts, most I can’t talk about, but there is one. It’s where a customer says that their account was hacked and that the hacker had placed bets from their account and lost all their money. They look to get the losses refunded. Fraud investigate on Splunk and determine if the account was actually hacked or are they just chancing their arm.
It’s not possible to show you some of the dashboards they use as there’s sensitive data, but this DB shows the countries with high numbers of failed logins. These are normally countries where gambling is banned or restricted.
Introduced to Splunk over 6 months ago. Managers and team leaders use the Dashboards to instantly identify probable issues on site that may be driving contact, for example an issue with a payments provider.
They want to react and get Tactical Messages out to stem contact levels as quickly as possible. This reduces call queues, time waiting and it’s a better experience for customers.
Before Splunk they would have had to depend on other teams like Prodops to confirm issues, this took time and normally lead to a build up in call queues.
They’re also using Splunk to speed up the turnaround of common customer queries. Things like cashout. Cashout is a function available to customers that allows ……. And they would investigate the reason for a cashout failing. Using Inputs on Dashboards has allows them easily search for details and quickly turnaround queries. Here are some examples of the dashboards they use.
This Used Case is a good example of how we data from different sources to create a customized view.
First off what do Capacity Management do. They manage our inventory of Hypervisors. These HVs are what build out our Virtual Machines and all the PPB applications run on these. It’s our private cloud. They Plan capacity requirements for future for events, such as Cheltenham, GN and any new products. Work with teams in ensuring they’re using the right amount of resources and their virtual machines are using the correct specs.
Problem was that all the information they needed all that was on 3 different Applications. To sort this we created nightly scripts that make API calls to gather info from ServiceNow, OpenTSDB and Openstack. This data is then forwarded in JSON format to Splunk.
In Splunk they were able to create a number of customized Dashboards combining all this data. Here are some examples
In the first snippet you can see that we’ve combined the data from Openstack for the inventory information and OpenTSDB for the actually resources used such as CPU for last Saturday, what we used for GN and for the last 20 days.
By selecting the inventory tenant you can drill down to expand the detail and we can get more info on what’s on each HV and resources they’re using.
Here we’ve combined the info from Openstack and Service now to give us details of a TLA, resources it has and the owners of that TLA.
This DB shows the distribution of VMs on each of the hypervisors. This shows if you’ve enough resilience if a HV fails.
Our two busiest periods of the year are Cheltenham and Grand National(Spring Racing). This is the same as Black Friday would be for online retail.
Splunk has ingested over 13TB of data during busy periods, but this needs to co-incide with zero latency. Zero latency is a priority for us, in order for us to monitor applications and react ASAP.
One example where Splunk proved itself, was during this years GN. All was going fine initially, until 5 mins after the race had finished.The next race was due to start around 45 mins later. An issue occurred with one of our online mobile apps, A P1 was raised and teams engaged for investigation.
After around 10 mins the cause of the issue was found using Splunk.It took ten mins to apply the fix to approximately fifty hosts. Splunk was also used to confirm the fix was applied to all the hosts and recovery was taking place. It's important to know that any change you make has been applied correctly and any errors occurring have stopped.
We were out of action for around 20 mins, but back up and running in plenty of time to take bets on the next race.
If we weren't using Splunk it would have been a needle in a haystack situation in finding the root cause and we would have definitely missed the next race.
This would have meant a loss in revenue, loss in customers and reputation. Customers can easily move to another competitor and it's hard to get them back.
SplunkLive! London 2019: Paddy Power Betfair
WHAT DO WE DO WITH THE 13TB OF DAILY INGEST?
Paddy Power Betfair: Who
can handle our data?
David Ashe Senior SRE
Gerry Healy SRE
SplunkLive! London June 2019
PPB SRE team
11 years in Banking
Over 9 years in PPB
SRE based in Dublin, London and Porto
Monitoring and Alerting
UK and Ireland UK&I, Europe, ROW Australia
Sportsbook Sportsbook and Daily-
Wagering (Tote) and
Channel Online and Retail Online Online Online and Retail Online
…plus a growing B2B portfolio…
Paddy Power Betfair: Part of the Flutter group
Situation before Splunk
Paddy Power and Betfair merged 2015
After the merger there were a lot of
synergies to be made. Single tools chosen
across the board
Manage Large number of sources, hosts
(1000s) and users
Scale well, Loads of Data, (7-15TBs) of daily
Initially required for Dev and ITOps to monitor
and get stats
Why Splunk Cloud?
Managed by Splunk in the cloud, scales very easily
Loads of free training and support resources on the web
Splunk support, CRM/CSM (Gavin Nash) provided. Escalate
anything to them.
Easy to onboard data
Easy to Automate in our pipeline deployments – we have
over 10000 devices so automating as much as possible is
Integrates great with other alerting tools - email, Slack
and PagerDuty when alerting on issues.
Single sign on with windows makes user management
Ingestion increase over 24 months
1TB to 13TB without compromising effectiveness of the tool
PPB consists on 100’s of microservices
Splunk Architecture and metrics
7 TB Average Daily
1m+ Daily Searches
Protect Customer accounts
One of the most active users of Splunk in PPBF
Identify accounts that have had a high number of
failed login attempts
Suspend accounts, contact customers and ask
them to use a strong password
Attacks from countries where gambling is restricted
or banned totally
Fraud - Quickly identify risk accounts
logins per country
Last 60 mins
Last 60 minutes
Customer Services – Reacting quicker
Be aware of issues before increase in contacts
Get Tactical Messages out to stem contact levels
Shorter queue and a better service for
Used to investigate common issues, quicker
Looking to expand to deal with other common
Capacity Management – REST interfaces
Know what your inventory is
and plan for future requirements
Understand VM distribution and
Ingests data produced by
nightly jobs that make API calls
to OpenStack and ServiceNow
Joins the data to build
Capacity Management – Custom built to help
manage our private cloud
Capacity Management – Drill down to find
TLA(Micro Service) owners
Capacity Management – Distribution of VMs on
Grand National busiest day
of the year
Ingesting 13TB of Data
Critical to have zero latency
Potential loss of revenue,
customers and reputation
Confirm fully recovered
Value of Splunk – Zero latency during busy days
Using correct sourcetypes =
Dashboards should only have
enough panels to fill your screen.
Save panels as saved search
Splunk Answers is a great
Tune Splunk – work with Splunk to
ensure you are sending data in the
most efficient way
Promote Splunk’s capabilities to
more commercial teams in PPB
With the help of our CSM -
organize roadshows in our