In 2016 Paddy Power and Betfair, two gambling giants, merged to form PPB. Each company had its own monitoring baggage, but the SRE team was tasked with cleaning up and consolidating their toolsets. This Sensu Summit 2019 talk from Artur Malinowski and Killian McHale looks at their selection process, scoring and ultimately the decisions which led them to Sensu – which now monitors over 10,000 clients across the PPB estate.
9. 9PPBs Sensu Journey
[CELLRANGE],
51%
[CELLRANGE],
21%
[CELLRANGE],
10%
[CELLRANGE],
18%
Market
Product
UK and Ireland UK&I, Europe, ROW Australia
USA
USA
Sportsbook and
Gaming
Sportsbook, Exchange
and Gaming
Sportsbook Sportsbook and Daily-
Fantasy-Sports
Advanced Deposit
Wagering (Tote) and
Television broadcast
Channel Online and Retail Online Online Online and Retail Online
…plus a growing B2B portfolio…
Brand
Revenue Mix1
Georgia, Armenia
Sportsbook and
Gaming
Online
15. 15PPBs Sensu Journey
Requirements
• Metric Collection
• Documentation
• User Interface
• Metric Graphing
• Updates/Regularity of Updates
• Features
• Performance
• Stability
• Time & Effort
• Scaling
• DR
• Interoperability / API
• API Completeness
16. 16PPBs Sensu Journey
Test Environment
• Scope of Environment
• Hypervisors
• VMs
• Network devices
• Storage
• Subset of applications
• Design Environment to test each solution in a consistent manner
21. 21PPBs Sensu Journey
Wait? What!?
• Are these guys at the wrong conference!?
• Purely based on our scoring Zenoss won
• Sensu came third!?
• Why are we here?
22. 22Presentation or section title
181
175
140
168
198
0
50
100
150
200
250
Nagios OMD Sensu Bosun Prometheus Zenoss
Score
Results
26. 26PPBs Sensu Journey
Current Implementation
Sensu Self-Service:
- Why Self-Service ?
- Design
- Plans
Sensu management:
- Detecting Silence
Checks
- Detecting machines
without client
- Detecting client
versions
27. 27PPBs Sensu Journey
• Sensu client is running on each
machine
• The Sensu client knows what
to do via information from
SUBSCRIPTIONS
Sensu's design
28. 28PPBs Sensu Journey
• Minimize wait times
• Owners know their hosts best
• Satisfy customers
• Fewer resources to manage
Why Self-Service ? No, we are not lazy - or at least
this is not the only reason!!!
29. 29PPBs Sensu Journey
• We are keeping all our
subscriptions in our gitlab repo
• All subscriptions are
automatically deployed to
correct Sensu instances after
uploading
• Changes are expected to be
reflected in Sensu within few
minutes
So how is it self-service ?
31. 31PPBs Sensu Journey
• Next step will be creating
fully automatic pipeline
which will check merge
requests and, if approved,
change will be
automatically merged
• Do you want to make
change at 3 AM because of
<reason/-s> – Sure why
not :)
Plans
Next ?
33. 33PPBs Sensu Journey
• The Sensu Audit connects to the
Sensu API to retrieve
information on all Sensu
alerting.
• It tracks silenced sensu alerts
and invalid sensu
configurations.
• It inserts data into the splunk
index every day.
Sensu Audit
34. 34PPBs Sensu Journey
Missing TLA's ( TLA's without Sensu)
• Dashboard to identify
missing basic checks
(CPU/load, Mem, disk).
• This is grouped by various
ratings - good, bad and
critical. Categorised
by Business rated apps
(Tier1-3>).
• Clickable links that will
allow users to drill down
more details, links to
Sensu UI and to allow
users to visit configuration
location.
35. 35PPBs Sensu Journey
Shows counts, silenced
/non silenced, events by
criticality. Events by
contact table displaying
top callouts by team.
Event Analysis
36. 36PPBs Sensu Journey
• This dashboard gives
information on all
silenced checks by TLA
and number of
individual checks.
• Information can be
filtered by Business
Criticality, Service,
TLA, trend, support
name and groups.
• Useful breakdown
based on risk counts
by name and team.
Finding silenced checks
37. 37PPBs Sensu Journey
Sensu Client Versions
• This Dashboard gives information about sensu client version, based on
TLA or hosts
38. 38PPBs Sensu Journey
• Easy to use
• Very readable json format
• Easy to join with other
information
• Good and well
maintain documentation
Conclusion – Sensu API is Powerful !!!
39. 39PPBs Sensu Journey
• Sensu Enterprise - End of
support March 31, 2020
• Investigation
Future? Sensu GO ?
Who the heck are PPB?!
PPB is the result of the merger between Paddy Power and Betfair 2016
Merge of equals
Two major online gambling companies coming together
A quick look at the two brands....
Purely Online
Traditional Sportsbook and Exchange. Exchange is platform to facilitate betting between two parties.
You as a user can bet for or against a given outcome at your desired odds and we will match that bet with someone of the other side.
And take a small commission on the winnings
Paddy Power is traditional Sportsbook.
You are betting against Paddy Power at the odds we offer.
We offer markets on hundreds of different events - mostly sporting.
Interesting bet
Paddy Power had a market for the 2016 US presidential election. In fact analysing all the data coming in we were able to predict the result before most and paid out on that one two weeks early…
Hrrmmm...
We all know the actual outcome of this one.
£5m... that's > $6m
2:30
PP doesn't take itself seriously. Always getting in trouble.
Cheltenham 2010 – Hollywood sign in a nearby field overlooking race track
Denmark’s Niklas Bendtner during Euro 2012 - EUR100,000 fine that PP paid
PP pants hot air balloon – tethered in local garden for Cheltenham 2013
Arguably our most controversial ad and topical again today. This was leaked by PP to various media outlets and let the keyboard warriors do their thing.
Raise awareness about real issues in the host country. Printed a retraction by getting the loggers back out.
4:30
As of earlier this year, we're part of Flutter Entertainment
Today we have multiple brands around the world.
Strongest in Europe but growing presence in other areas
Fanduel in the US. Actually have a retail shop in New Jersey
Growing B2B portfolio.
The Before
Two stacks collide
5:00
In 2016 merger
- two very different stacks came together
- differing monitoring tools
- Nagios
- Opsview
- A little Sensu
- Various other tools in metric and log analysis space
SRE team were tasked with consolidating our toolset and pick the best tools to support both stacks now and into the future
So to start we mapped out our approach….
The Approach
- What’s the problem
- What were the requirements
- How can we create a framework around them
- What do test environments look like
- What do tests look like
Short list
- Look at available tooling
- Compare at a high-level
- Narrow down to short list
Every good project has a plan.
Perfectly laid out and timed plan
We met all these dates:
Requirements
- Metric Collection / Graphing
- User Interface
- Documentation - Updates
- Feature Set
- Performance and Stability
- Time & Effort
- Interoperability
Series of questions. Weighted.
Design Environment to test each solution in a consistent manner
Scope of the environment
- Hypervisors
- VMs
- Network Devices
- Storage
- Subset of apps
Design to test each solution consistently
7:15
Short list.
As per plan we put these through their paces
Each solution was tested against our requirements
Perfect or max score was ~240
So having test all the solutions
Without further ado
No surprises
Winner is
I know what you’re thinking…
- Are they at the wrong conference?
- Based on scoring Zenoss won
- Sensu cam third place
- Why are we here
Lets take a look at the scores…
8:30
Zenos – right hand side – highest score of 198
Sensu – second from left – third at 175
This didn’t feel right…
At this point we eliminated Bosun
Things like API, Updates, Migration Path were quite important to us
Zenoss – great product full of features
- API – Not well documented. Wasn’t clear how we could integrate with it.
- Complexity – Self–service model and vast estate. Added complexity.
Nagios OMD
API – Not up to scratch for our requirements
Updates – Not updated in over a year
There can be two
Sensu scored well across all categories and had good coverage across the things that were most important to us.
When we combined it with Prometheus we found we had something that matched Zenoss feature set, but gave us the flexibility that we needed
Also Migration path from OpsView and Nagios to Sensu was really nice with them all running Nagios compatible checks
Talk about our implementation
Don't focus on how sensuworks, focus on cool thing which we are doing with sensu.
My part is divided into 2 sections – Self-Service and Sensu-Management (How we are using sensulogs)
Quick very simple description of how it works.
Each of our host should have sensu client installed
Sensu client is a monitoring agent, installed on a system to be monitored
Sensuclient is running on machine, client receives information about what should be done by instruction from subscription.
Because it is so simple, it is very easy to understand, and as everyone knows simplicity is always key.
More then 10 000 machines so it is impossible for us to implement checks for each of them.
We have a lot of different applications, so we can't know in details what exactly needs to be checked
Save time, a lot of time.
Happy customers mean happy us
Everything is kept on gitlabas it is a great place to share code with other engineers.
Basic checks made by githook.
Gitlab repo is easy to control (not allowed to make bad changes which can break something), request access (easy to give access to users who want to add new checks), share (very simple to share with people as we can easily send URL links with details to rep).
Git is a very common version-control system, so almost everyone knows how to use it and if not it is very easy to explain to people how to do it.
Also there is gitlabGUI which users can use to check their checks, Of course there is sensuGUI which can be used to check clients, but by gitlabwe can see subscription json file exactly.
Subscriptions are divided into zones (dev, prd, Ie1, ie2 and sre), by this we always deploy new subscriptions to correct Sensuinstance.
So there is situation in which you are a new user who wants to add a new check.
Mitigator is reverse proxy -- directs client requests to the appropriate backend server
1. You need to create repo fork -> Make your changes -> Create merge request (End of user work) -> Next gitlabwebhooks will post to correct Sensu instances new changes. -> Explain what is mitigator -> Mitigator posts to all sensuAPI and sensuservers -> Places flag on filesystem to signal that they need to be GIT CHECK -> There is specially created cronjob which is to check if there is this flag -> If there is this flag, simpliygit checkout. -> Check if changes are valid -> if yes simply restart sensu-enterpise, if not, alert is triggered for SRE that something is wrong and changes are not implemented, In this situation we have time until 3:30 AM to fix it( Why 3:30 ? Because all sensu-enterpisesrestart at 3:30)
More self-service
Jenkins pipeline, with special gitlabplugin which allows you to automatically merge MR if pipeline success.
Why we need even more self service – we are using human gate as the last step before merging to be sure that changes are correct but because it took us sometimes a lot of time and it interrupted engineers we are planning to automate it, We will check the most crucial part of changes. Also if our team is really busy, customers can wait even few hours before anyone can find some time to merge it, what can result in worst customer satisfaction
Instead of human gate at the end, there will be pipeline.
Very easy workflow – User creates new MR, GitLab send notification about MR to jenkins,jenkinsrun pipeline with all checks, if result are fine, great day reusltis send to gitlab, and the rest is done as I explained in previous slides
At the begging we will leave human gate, and jenkinswill only send us information if MR can be merged, but the last decision will be ours. After few weeks if result of tests will satisfy us, we will make it fully automated without any human gate at the end.
As we have hundreds of applications and thousands of hosts, we need to have some control over sensuworking on them and that why we start using SesnuAPI and Splunk to do it
We are using spec
ially created LRP (light reporting platform) in which we have codes which collect all data by sensuAPI, process and take the information that we are interested in.
Dashboard is created from sensulogs and information about our host
Explain what is TLA
Dashboard identify if any of our TLA is missing the most important checks
It helps us to know who we need to contact about missing checks, how important it is by business rated apps. List of TLAs which are effected by missing most crucial checks.
TLA – Three letter acronym (service name)
As we have thousands of events, we needed to have something which will visualize status of our events, if we have some critical situation with a lot of bad events, or if everything is fine. Also we can use it to troubleshoot incidents as we can see what contact was alerted about bad events.
This dashboard helps us to analyze events, how many of these are critical, what hosts are effected, who was contacted about specific events
Again when we have hundreds of users it is very hard to keep all of them to not silence checks for ever, as always someone can forget about it or there is any other reason. That's why we have this dashboard
Don’t understand me wrong, silence option is great, but people should not leave silence check forever, that's why we need to have this dashboard to check how many checks are silenced, and again who needs to be contacted.
To control versions of sensuclients we have another dashboard. When we know how many TLAs we have UpToDate, we know where we are and how many of them are old.
Especially very useful, when we update our package to new client version and we want to know how many hosts have been updated and who should be convince to redeployed TLA to have a new version of sensu.
You can't see it here but when you click on a section of this chart you can see the list of hosts
SensuAPI is easy and powerful. You can gather information about everything that you need, without checking GUI or configuration.