Segue um material interessante do que a Vodafone está fazendo com o Splunk.
Esse em especial foi apresentado no .conf2013, convenção mundial da Splunk e teremos o .conf2014 em Outubro desse ano - programem-se e participem, vale cada centavo!
Lembrando, o .conf2014 já está com as inscrições abertas e em preço promocional.
Mais informações, aqui: http://conf.splunk.com/?r=homepage
2. Information Manager & Data Analyst
• Norbert holds a Master’s Degree in Mechanical Engineering from
the RWTH Aachen/Germany.
• He has been working in the IT industry for nearly 20 years now,
initially in Marketing and Technical Writing.
• With Vodafone for the last 4 years, he is involved in Reporting and
ultimately built up the Splunk infrastructure in his department,
analyzing data from 1000+ virtual servers/machines and 90
databases with 20 Splunk instances.
• His focus in Splunk is the creation of sophisticated dashboards
which present valuable information in an easy-to-use manner.
Various audience groups including 24/7 Monitoring, Operations &
Support as well as Management are using these dashboards to
gain personal insight into complex technical processes. 2
3. Agenda
• Vodafone Group Services Emerging Technologies Deployment &
Support – What We Do
• Why We are Using Splunk
• Splunk Infrastructure
• Splunk Dashboards – A Journey Over 2 Years
• Splunk Enterprise 6 – Our Next Steps
• Splunk Enterprise 6 Live Demo – What We Have Tested
• Summary
3
5. Use Case: Carrier Billing
• Today we are using our smartphones more and more for purchasing
different kinds of digital goods.
• One common use case is to buy apps, music or subscriptions to
various services, where the purchased good is used directly on the
smartphone.
• Usually the mobile users expects to find this purchases on his monthly
bill from his mobile network carrier. Alternatively, if the user has a
prepaid contract, the purchase should be taken from his balance.
• The team presenting its Splunk use case at the .conf is called
Deployment & Support, which is part of Vodafone’s Emerging
Technologies department.
• The team is responsible for operating a complex platform for carrier
billing in 20 Vodafone companies and partner markets . For the sake of
simplicity we will call this the “charging platform“.
Charging
Platform
5
7. There is a Bit More Behind the Scenes ...
• To complete the cycle of purchasing some questions have to be answered:
– Who is this the mobile customer?
– Is the customer using a mobile device or a PC?
– Is this a prepaid or postpaid customer?
– To which local market is the customer registered?
– Which business partner is providing the purchased digital goods?
– Which business partner takes a share of the revenue?
– Which additional systems needs to be informed about the purchase?
• Ultimately the overall cycle results in several requests and responses
between the mobile customer‘s device and multiple server systems.
• Currently the deployment & support team operates 570 virtual servers/
machines and approx. 40 databases for this purpose in production and
test environments.
7
8. Several Entities Are Involved or Affected Somehow
Charging
Platform
Customer
Care
Pricing Billing
24x7
Monitoring
Development
Deployment
Management
Business
Partner
8
10. Why We Are Using Splunk
• Before Splunk: Cacti, RRDtools, Tail, Log files ...
• Have all data from different IT systems available in one single
environment and in (nearly) real-time.
• Correlation of data from different sources.
• Easy-to-use interface for any non-technical audience in multiple
user groups.
10
16. Requests and
Response
Times
• This is where we started from:
Showing requests and response times from a group of
servers which perform the same tasks
Average Response Times
Amount of Requests
16
17. Simple KPI
Dashboard –
Set Color
Based on
Values
• Our KPI dashboards show the performance of
different services.
• Since we have SLA targets of nearly 100% it‘s
sometimes hard to visualize if the target was met or
breached.
• We decided to set another color for the columns if the
target was breached for a certain time range.
• Since we could not directly assign a color to a certain
value, we generated 2 rows of results, from which one
takes the good ones in green and the other takes the
breach values in red.
17
18. Set Colors by
Value for Line
Charts
• Some processes of post-processing data are supposed
to be finished within a time range of 2 hours
(120 minutes).
• This chart shows the actual processing time for a
certain part of the process.
• Processing times below the limit are in green, above
in red.
• In case the processing takes mor than 10 hours, the
chart will cut the line in black.
• Similar to the KPI dashboard, this is realized using
multiple rows of result which are layered on top of
each other.
18
19. Correlation of
Amount of
Requests and
Response
Times
• The next step: combine amount of requests and
response times into a single chart.
• Amounts are rendered as column chart, repsonse times
as line charts.
• Each chart uses its own y-axis scaling.
• Additional gauge chart for real time view.
19
20. Showing
Maintenance
• If you have a monitoring team watching your
dashboards, it might be helpful to inform them about
maintenance periods.
• Should unusual charts show up during maintenance,
the monitoring team can immediately see that this might
be related to planned maintenance, and act accordingly.
• The maintenance graph can be realized with one single
event stating the start and end time, e.g. from one
database record.
20
21. Combine
Summary
Indexing with
Drilldown to
Live Search
• Sometimes you might find gaps in charts build on
summary indexes. So the charts are fast loading,
but incomplete.
• In this case we can drill down to another version of
the same chart, which is using the live index instead
of summary.
Summary
index
Live index
21
22. Regression
Test with
Different
Software
Versions
• Run regression tests with different versions of a
software or different settings.
• The dashboard will automatically find all different runs
and fill this to drop down lists, regardless of when the
test run was executed.
• Easily compare selected runs over a certain time frame
which might show significant values.
22
24. Compare
Results with
Previous
Weeks
• This chart shows the amount of purchase transactions
in the charging platform for a certain selection of
local market, business partner or other
characteristic attributes.
• The amount of transactions is displayed as column
chart, for example split by product.
• The overlay shows comparative values for the same
time range from 3 weeks before.
24
25. Compare
Results with
Previous
Weeks
• In this situation something is different from
previous weeks.
• We can see a significant increase of transactions, and
all of those are coming from the items marked in yellow.
• The 3rd layer in the background is rendered as an area
chart and shows transactions with errors only.
25
26. Revenue
Loss
Calculation
• As mentioned in the introduction, the charging platform
supports business cases to generate revenue from
purchasing processes.
• On the other hand this means, that an outage in the
charging platform may impact the revenue and
potentially lead to revenue loss.
• The revenue loss calculator is a tool that supports
people involved in technical issues to quickly detect the
potential financial impact of an outage and take the
appropriate actions.
26
28. Ticket History
• Using Splunk to create reports about ticketing systems
enables us to make the related information available to
a wide audience.
• Here each “team” can identify within seconds how many
tickets are open and their status. They can watch a list
of details as well.
• The history shows the trend of tickets in “open” status
28
29. Scatter
Chart with
Transparency
• Scatter charts can help to identify significant patterns in
the relation between 2 attributes.
• In this case we measure the resolution time of software
defects within a development team over time.
• Before May we have encountered “bug closing parties“
on certain days resulting in high peak values.
• After changing the processes the team is now
working continously on defects resulting in lower
resolution times.
29
30. Writing Data
to Splunk to
Sort-Out
Incorrect
Tickets
• We also use Splunk to report SLA-related information.
• Sometimes there might be tickets which are assigned to
our team by mistake.
• Since we don‘t want to have those tickets in our SLA
report, we created a dashboard where users can
identify the tickets and generate comments on them.
• The comments are then written to a lookup table, which
is used as a filter for showing only relevant tickets in
SLA reports.
30
32. What Will We
Get from
Splunk?
• Overlay charts without Flash.
• Interactive dashboards with forms in simple XML.
• Data models to provide customised information access.
32
35. Sophisticated Dashboards with Splunk
• Using Splunk you can create sophisticated dashboards which meet the
requirements of various audience groups, including monitoring teams,
real techies, as well as upper management.
• Splunk dashboards can cover technical processes as well as business
cases and organizational processes.
• Splunk Enterprise 6 will leverage most of the functionality required for
sophisticated dashboarding to a level which can be used by a wider
range of users.
35
36. Next Steps
Download the .conf2013 Mobile App
If not iPhone, iPad or Android, use the Web App
Take the survey & WIN A PASS FOR .CONF2014… Or one of these bags!
View all “What’s New” presentations
PPTs on the .conf2013 Mobile App
Recordings will be available shortly
1
2
3
36
39. The Ultimate
Log File
Format
ULFF
• After the first steps with Splunk, we found that several
different log file formats may lead to errors or at least
confusion, e.g. if timestamps or return codes do not
share the same format.
• The resulting issues can be solved within Splunk,
but this requires additional configuration and
processing power.
• Finally we decided to define a new ultimate log file
format (ULFF) which is being implemented in all
applications feeding the Charging Platform – ULFF is a
custom JSON format.
• {"transaction-id":"1234-5678-9012",
"usecase-id":"9876-5432-1098",
"timestamp":"2013-03-10T20:24:25,123+01:00",
"country-code":"GB",
"status":"ok",
"error":"",
"payload":"<xml attr="value"></xml>"}
39
40. Some Examples of Complexity: Prepaid or Postpaid
• In case a customer has a postpaid contract, the charging platform sends
information to the customer‘s local Vodafone market, and they will put the
purchase on the customer’s bill.
• In case the customer is a prepaid customer, the charging platform performs
several additional steps:
1. Check if the customer‘s balance is sufficient for the desired purchase.
2. Somehow “block” the amount required to purchase the item.
3. Provide the desired item.
4. Finally take the amount from the “blocked” balance.
40
41. Some Examples of Complexity: Cellular or Wireless
LAN Connection
• If customer has a connection to his mobile carrier’s network when starting a
purchase cycle, the authentication can be established using the MSISDN
from the customer’s SIM card.
• If the customer is located in an area without mobile network coverage but
WLAN connectivity only, there is no MSISDN communicated for the charging
platform to refer to.
In this case the user may install an app on his mobile device which is able to
provide the MSISDN via wireless LAN connectivity as an alternative.
• If the customer is not using a mobile device at all, but a PC instead, there is
no way to provide the MSISDN automatically.
In this case the user may manually enter the MSISDN which is to be charged
for the purchase. For authentication, the charging platform could send an
SMS to this MSISDN providing a one-time authorization code for
the purchase.
41
42. Showing
Long Term
Trends
• The overlay technique is a good tool for operation
engineers or monitoring teams who need to observe the
current situation of server systems.
• But those charts can also be used to get a better insight
in long-term processes.
42
43. Database
Performance
• Before using Splunk, we took this information
from rrdtools.
• Now we monitor physical IO values (reads, writes,
redos), waits, sessions and CPU or disk usage directly.
in Splunk.
43
44. Correlation of
Amount of
Requests and
Response
Times
• The next step: combine the amount of requests and
response times into a single chart.
• Amounts are rendered as column chart, response times
as line charts.
• Each chart uses its own y-axis scaling.
44
45. Monitoring
Message
Queues
• Real-time gauge charts to monitor the amount of
messages processed in Java messaging queues.
• In parallel, get a more detailed view over the last 60
minutes comparing the amount of requests with
processing time.
45
46. Spanning
Based on
Selected
Time Range
• In time charts the spanning is either set to a fixed value or
calculated automatically.
• Fixed spanning will result in too many measuring points for
long time ranges, auto-mode will result in different max values.
• We have implemented time charts where the spanning is
adjusted to time range, values are automatically calculated as
comparable Transaction Per Minute (TPM).
46
47. Time Range
Selector
Without
“All Time”
• Time range selectors usually allow the definition
of “custom time“, which might result in very
long-running searches.
• Using form elements we have created our own time
range selector, where the user can only select from
predefined time ranges – no more custom time.
47
48. Splunking Ticketing Systems
• After we have sent all our application and database information to
Splunk, we started to find other data sources.
• Reporting about ticketing systems like HP Quality Center or Remedy
BMC took a lot of manual effort in the past.
• Ticket information is a bit different in terms of Splunking.
– Usually we have multiple records for a ticket over its lifetime.
– Ticket records may carry several timestamps, e.g. to identify different
statuses.
– Each and every record for one single ticket may be very important.
48
49. Simple
Revenue
Trends Chart
• As mentioned before, we use splunk to provide
information to different audiences, for example
the management.
• Since the managament is more interested in figures
about business proceses than technical processes, we
can easily provide charts showing revenue trends.
• But the information about financial transactions can also
be helpful for users with a technical focus, such as on-
call engineers.
49
50. Compare
Results with
Previous
Weeks
• The 3rd layer in the background is rendered as
area chart.
• This area shows transactions with errors only.
• The are background charts which are only visible if
there are more errors than successful transactions.
50