Caso de Sucesso Vodafone e Splunk

1,198 views
1,044 views

Published on

Segue um material interessante do que a Vodafone está fazendo com o Splunk.

Esse em especial foi apresentado no .conf2013, convenção mundial da Splunk e teremos o .conf2014 em Outubro desse ano - programem-se e participem, vale cada centavo!

Lembrando, o .conf2014 já está com as inscrições abertas e em preço promocional.

Mais informações, aqui: http://conf.splunk.com/?r=homepage

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,198
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
27
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Caso de Sucesso Vodafone e Splunk

  1. 1. Advanced Splunk Dashboards in Operations and Support Norbert Hamel Vodafone Group - Emerging Technologies Deployment & Support #splunkconf
  2. 2. Information Manager & Data Analyst •  Norbert holds a Master’s Degree in Mechanical Engineering from the RWTH Aachen/Germany. •  He has been working in the IT industry for nearly 20 years now, initially in Marketing and Technical Writing. •  With Vodafone for the last 4 years, he is involved in Reporting and ultimately built up the Splunk infrastructure in his department, analyzing data from 1000+ virtual servers/machines and 90 databases with 20 Splunk instances. •  His focus in Splunk is the creation of sophisticated dashboards which present valuable information in an easy-to-use manner. Various audience groups including 24/7 Monitoring, Operations & Support as well as Management are using these dashboards to gain personal insight into complex technical processes. 2
  3. 3. Agenda •  Vodafone Group Services Emerging Technologies Deployment & Support – What We Do •  Why We are Using Splunk •  Splunk Infrastructure •  Splunk Dashboards – A Journey Over 2 Years •  Splunk Enterprise 6 – Our Next Steps •  Splunk Enterprise 6 Live Demo – What We Have Tested •  Summary 3
  4. 4. Vodafone Group Services Emerging Technologies Deployment & Support What We Do
  5. 5. Use Case: Carrier Billing •  Today we are using our smartphones more and more for purchasing different kinds of digital goods. •  One common use case is to buy apps, music or subscriptions to various services, where the purchased good is used directly on the smartphone. •  Usually the mobile users expects to find this purchases on his monthly bill from his mobile network carrier. Alternatively, if the user has a prepaid contract, the purchase should be taken from his balance. •  The team presenting its Splunk use case at the .conf is called Deployment & Support, which is part of Vodafone’s Emerging Technologies department. •  The team is responsible for operating a complex platform for carrier billing in 20 Vodafone companies and partner markets . For the sake of simplicity we will call this the “charging platform“. Charging Platform 5
  6. 6. Sounds Easy? Charging Platform 6
  7. 7. There is a Bit More Behind the Scenes ... •  To complete the cycle of purchasing some questions have to be answered: –  Who is this the mobile customer? –  Is the customer using a mobile device or a PC? –  Is this a prepaid or postpaid customer? –  To which local market is the customer registered? –  Which business partner is providing the purchased digital goods? –  Which business partner takes a share of the revenue? –  Which additional systems needs to be informed about the purchase? •  Ultimately the overall cycle results in several requests and responses between the mobile customer‘s device and multiple server systems. •  Currently the deployment & support team operates 570 virtual servers/ machines and approx. 40 databases for this purpose in production and test environments. 7
  8. 8. Several Entities Are Involved or Affected Somehow Charging Platform Customer Care Pricing Billing 24x7 Monitoring Development Deployment Management Business Partner 8
  9. 9. Why We Are Using Splunk
  10. 10. Why We Are Using Splunk •  Before Splunk: Cacti, RRDtools, Tail, Log files ... •  Have all data from different IT systems available in one single environment and in (nearly) real-time. •  Correlation of data from different sources. •  Easy-to-use interface for any non-technical audience in multiple user groups. 10
  11. 11. Splunk Infrastructure
  12. 12. Forwarding Indexing Syslo g UFs Scripts Searching / dashboards Data input Queue s DB 12
  13. 13. Data Sources •  Apache •  Jboss application •  SQL databases •  HornetQ •  Tibco EMS •  Remedy BMC •  HP Quality Center •  HP OpenView •  IBM DataPower •  Business objects •  Pentaho •  Excel •  Soon: Hadoop 13
  14. 14. Splunk Dashboards A Journey Over 2 Years
  15. 15. Splunking Server Applications and Databases 15
  16. 16. Requests and Response Times •  This is where we started from: Showing requests and response times from a group of servers which perform the same tasks Average Response Times Amount of Requests 16
  17. 17. Simple KPI Dashboard – Set Color Based on Values •  Our KPI dashboards show the performance of different services. •  Since we have SLA targets of nearly 100% it‘s sometimes hard to visualize if the target was met or breached. •  We decided to set another color for the columns if the target was breached for a certain time range. •  Since we could not directly assign a color to a certain value, we generated 2 rows of results, from which one takes the good ones in green and the other takes the breach values in red. 17
  18. 18. Set Colors by Value for Line Charts •  Some processes of post-processing data are supposed to be finished within a time range of 2 hours (120 minutes). •  This chart shows the actual processing time for a certain part of the process. •  Processing times below the limit are in green, above in red. •  In case the processing takes mor than 10 hours, the chart will cut the line in black. •  Similar to the KPI dashboard, this is realized using multiple rows of result which are layered on top of each other. 18
  19. 19. Correlation of Amount of Requests and Response Times •  The next step: combine amount of requests and response times into a single chart. •  Amounts are rendered as column chart, repsonse times as line charts. •  Each chart uses its own y-axis scaling. •  Additional gauge chart for real time view. 19
  20. 20. Showing Maintenance •  If you have a monitoring team watching your dashboards, it might be helpful to inform them about maintenance periods. •  Should unusual charts show up during maintenance, the monitoring team can immediately see that this might be related to planned maintenance, and act accordingly. •  The maintenance graph can be realized with one single event stating the start and end time, e.g. from one database record. 20
  21. 21. Combine Summary Indexing with Drilldown to Live Search •  Sometimes you might find gaps in charts build on summary indexes. So the charts are fast loading, but incomplete. •  In this case we can drill down to another version of the same chart, which is using the live index instead of summary. Summary index Live index 21
  22. 22. Regression Test with Different Software Versions •  Run regression tests with different versions of a software or different settings. •  The dashboard will automatically find all different runs and fill this to drop down lists, regardless of when the test run was executed. •  Easily compare selected runs over a certain time frame which might show significant values. 22
  23. 23. Splunking Business Processes 23
  24. 24. Compare Results with Previous Weeks •  This chart shows the amount of purchase transactions in the charging platform for a certain selection of local market, business partner or other characteristic attributes. •  The amount of transactions is displayed as column chart, for example split by product. •  The overlay shows comparative values for the same time range from 3 weeks before. 24
  25. 25. Compare Results with Previous Weeks •  In this situation something is different from previous weeks. •  We can see a significant increase of transactions, and all of those are coming from the items marked in yellow. •  The 3rd layer in the background is rendered as an area chart and shows transactions with errors only. 25
  26. 26. Revenue Loss Calculation •  As mentioned in the introduction, the charging platform supports business cases to generate revenue from purchasing processes. •  On the other hand this means, that an outage in the charging platform may impact the revenue and potentially lead to revenue loss. •  The revenue loss calculator is a tool that supports people involved in technical issues to quickly detect the potential financial impact of an outage and take the appropriate actions. 26
  27. 27. Splunking Ticketing Systems 27
  28. 28. Ticket History •  Using Splunk to create reports about ticketing systems enables us to make the related information available to a wide audience. •  Here each “team” can identify within seconds how many tickets are open and their status. They can watch a list of details as well. •  The history shows the trend of tickets in “open” status 28
  29. 29. Scatter Chart with Transparency •  Scatter charts can help to identify significant patterns in the relation between 2 attributes. •  In this case we measure the resolution time of software defects within a development team over time. •  Before May we have encountered “bug closing parties“ on certain days resulting in high peak values. •  After changing the processes the team is now working continously on defects resulting in lower resolution times. 29
  30. 30. Writing Data to Splunk to Sort-Out Incorrect Tickets •  We also use Splunk to report SLA-related information. •  Sometimes there might be tickets which are assigned to our team by mistake. •  Since we don‘t want to have those tickets in our SLA report, we created a dashboard where users can identify the tickets and generate comments on them. •  The comments are then written to a lookup table, which is used as a filter for showing only relevant tickets in SLA reports. 30
  31. 31. Splunk Enterprise 6 Our Next Steps
  32. 32. What Will We Get from Splunk? •  Overlay charts without Flash. •  Interactive dashboards with forms in simple XML. •  Data models to provide customised information access. 32
  33. 33. Splunk Enterprise 6 Live Demo What We Have Tested
  34. 34. Summary
  35. 35. Sophisticated Dashboards with Splunk •  Using Splunk you can create sophisticated dashboards which meet the requirements of various audience groups, including monitoring teams, real techies, as well as upper management. •  Splunk dashboards can cover technical processes as well as business cases and organizational processes. •  Splunk Enterprise 6 will leverage most of the functionality required for sophisticated dashboarding to a level which can be used by a wider range of users. 35
  36. 36. Next Steps Download the .conf2013 Mobile App If not iPhone, iPad or Android, use the Web App Take the survey & WIN A PASS FOR .CONF2014… Or one of these bags! View all “What’s New” presentations PPTs on the .conf2013 Mobile App Recordings will be available shortly 1 2 3 36
  37. 37. Thank You
  38. 38. Backup
  39. 39. The Ultimate Log File Format ULFF •  After the first steps with Splunk, we found that several different log file formats may lead to errors or at least confusion, e.g. if timestamps or return codes do not share the same format. •  The resulting issues can be solved within Splunk, but this requires additional configuration and processing power. •  Finally we decided to define a new ultimate log file format (ULFF) which is being implemented in all applications feeding the Charging Platform – ULFF is a custom JSON format. •  {"transaction-id":"1234-5678-9012", "usecase-id":"9876-5432-1098", "timestamp":"2013-03-10T20:24:25,123+01:00", "country-code":"GB", "status":"ok", "error":"", "payload":"<xml attr="value"></xml>"} 39
  40. 40. Some Examples of Complexity: Prepaid or Postpaid •  In case a customer has a postpaid contract, the charging platform sends information to the customer‘s local Vodafone market, and they will put the purchase on the customer’s bill. •  In case the customer is a prepaid customer, the charging platform performs several additional steps: 1.  Check if the customer‘s balance is sufficient for the desired purchase. 2.  Somehow “block” the amount required to purchase the item. 3.  Provide the desired item. 4.  Finally take the amount from the “blocked” balance. 40
  41. 41. Some Examples of Complexity: Cellular or Wireless LAN Connection •  If customer has a connection to his mobile carrier’s network when starting a purchase cycle, the authentication can be established using the MSISDN from the customer’s SIM card. •  If the customer is located in an area without mobile network coverage but WLAN connectivity only, there is no MSISDN communicated for the charging platform to refer to. In this case the user may install an app on his mobile device which is able to provide the MSISDN via wireless LAN connectivity as an alternative. •  If the customer is not using a mobile device at all, but a PC instead, there is no way to provide the MSISDN automatically. In this case the user may manually enter the MSISDN which is to be charged for the purchase. For authentication, the charging platform could send an SMS to this MSISDN providing a one-time authorization code for the purchase. 41
  42. 42. Showing Long Term Trends •  The overlay technique is a good tool for operation engineers or monitoring teams who need to observe the current situation of server systems. •  But those charts can also be used to get a better insight in long-term processes. 42
  43. 43. Database Performance •  Before using Splunk, we took this information from rrdtools. •  Now we monitor physical IO values (reads, writes, redos), waits, sessions and CPU or disk usage directly. in Splunk. 43
  44. 44. Correlation of Amount of Requests and Response Times •  The next step: combine the amount of requests and response times into a single chart. •  Amounts are rendered as column chart, response times as line charts. •  Each chart uses its own y-axis scaling. 44
  45. 45. Monitoring Message Queues •  Real-time gauge charts to monitor the amount of messages processed in Java messaging queues. •  In parallel, get a more detailed view over the last 60 minutes comparing the amount of requests with processing time. 45
  46. 46. Spanning Based on Selected Time Range •  In time charts the spanning is either set to a fixed value or calculated automatically. •  Fixed spanning will result in too many measuring points for long time ranges, auto-mode will result in different max values. •  We have implemented time charts where the spanning is adjusted to time range, values are automatically calculated as comparable Transaction Per Minute (TPM). 46
  47. 47. Time Range Selector Without “All Time” •  Time range selectors usually allow the definition of “custom time“, which might result in very long-running searches. •  Using form elements we have created our own time range selector, where the user can only select from predefined time ranges – no more custom time. 47
  48. 48. Splunking Ticketing Systems •  After we have sent all our application and database information to Splunk, we started to find other data sources. •  Reporting about ticketing systems like HP Quality Center or Remedy BMC took a lot of manual effort in the past. •  Ticket information is a bit different in terms of Splunking. –  Usually we have multiple records for a ticket over its lifetime. –  Ticket records may carry several timestamps, e.g. to identify different statuses. –  Each and every record for one single ticket may be very important. 48
  49. 49. Simple Revenue Trends Chart •  As mentioned before, we use splunk to provide information to different audiences, for example the management. •  Since the managament is more interested in figures about business proceses than technical processes, we can easily provide charts showing revenue trends. •  But the information about financial transactions can also be helpful for users with a technical focus, such as on- call engineers. 49
  50. 50. Compare Results with Previous Weeks •  The 3rd layer in the background is rendered as area chart. •  This area shows transactions with errors only. •  The are background charts which are only visible if there are more errors than successful transactions. 50

×