Splunk .conf2011: Real Time Alerting and Monitoring

4,024 views
3,695 views

Published on

Published in: Technology, Design
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,024
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
116
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Splunk .conf2011: Real Time Alerting and Monitoring

  1. 1. Monitoring and Alerting Ledion Bitincka, Search and Alerting Team
  2. 2. <ul><li>Search and Reporting Team </li></ul><ul><li>@ Splunk for 4+ years - since 3.0 </li></ul><ul><li>Things I’ve worked on: </li></ul><ul><ul><li>Key-value extractions </li></ul></ul><ul><ul><li>Transactions, Eventtyping, Typeahead, Summary Indexing </li></ul></ul><ul><ul><li>Monitoring and alerting framework </li></ul></ul><ul><ul><li>Other random @#$% </li></ul></ul>Intro … Ledion Bitincka (aka Splunk Albanian)
  3. 3. <ul><li>Why use Splunk for monitoring and alerting? </li></ul><ul><li>Basic alerting </li></ul><ul><li>Advanced alerts and config options </li></ul><ul><li>Real-time alerting and throttling (new in 4.2) </li></ul><ul><li>Alert Manager (new in 4.2) </li></ul><ul><li>Sneak peek into new features … </li></ul><ul><li>Feel free to interrupt when you don’t follow!!! </li></ul>Agenda
  4. 4. Life Without Splunk Service Desk Application Support Systems Administrator Application Developer Application Developer Database Administrator Log call. The console says everything is green. App monitoring tools don’t show anything either. Call the developer. Stop working on new code to troubleshoot. Need production logs! Stop what they’re doing to identify and gather production logs for developer. Manual investigation establishes not application problem. DBA analyzes the logs which points to corrupted database files. Escalate. Escalate. Escalate. Respond. Escalate. Now what?
  5. 5. Life With Splunk Service Desk Trouble Ticket Search on IP address shows related Web session and User ID “ 192.168.169.100” Last 60 minutes 192.168.169.100 Search at same time reveals database error due to corrupted files Search for failure or error across entire IT Last 2 minutes failure OR error Search on corruption in the db logs shows that an index file has been corrupted Search for corruption in db logs Last 1 minute host=db.domain.com source=*db.log corrupt* Setup monitoring and alerting for db file corruption Set up Monitoring and Alerting Last hour host=db.domain.com source=*db.log corrupt*
  6. 6. One Splunk. Many Uses.
  7. 7. Monitor and Alert in Real Time 2. Evaluate alerting condition 1. Get data Scheduled search Real-time search Alert Condition 3. Execute actions RSS Email SNMP Script Yes Noop No
  8. 8. <ul><li>Create simple alert using wizard … </li></ul><ul><li>Available alert actions … </li></ul><ul><li>Configure email settings (MTA, link hostname)... </li></ul>Demo… (5-10 min)
  9. 9. Advanced Alerting Options <ul><li>Specify an advanced schedule using cron notation </li></ul><ul><li>Use custom alert conditions </li></ul><ul><li>Invoke scripts to perform custom actions </li></ul><ul><ul><li>Integrate with other tools </li></ul></ul><ul><ul><ul><li>file trouble ticket </li></ul></ul></ul><ul><ul><li>Other custom processing </li></ul></ul><ul><ul><ul><li>restart a faulty service </li></ul></ul></ul><ul><ul><ul><li>update a firewall rule </li></ul></ul></ul><ul><ul><ul><li>temporarily disable a user account </li></ul></ul></ul><ul><ul><ul><li>etc … </li></ul></ul></ul>
  10. 10. Real-time Search Primer <ul><li>Searches forward in time </li></ul><ul><li>Never completes (unless stopped) </li></ul><ul><ul><li>Constantly updating result set </li></ul></ul><ul><ul><li>Only generates results preview </li></ul></ul><ul><li>All search commands supported </li></ul>
  11. 11. Splunkd/ Scheduler Search Process time Search Start historical search Suppress? Logging Scheduled Search Alerts audit.log search.log Y N Notify splunkd splunkd_access.log audit.log Search done <ul><li>Execute actions </li></ul><ul><li>Update artifact TTL </li></ul><ul><li>Suppression update </li></ul><ul><li>Alert manager </li></ul>N Y Done scheduler.log Condition Results
  12. 12. Real-time Alerts Splunkd/ Scheduler Search Process time RT Search Start RT search Suppress? Logging … .. audit.log search.log Y N Notify splunkd splunkd_access.log <ul><li>Execute actions </li></ul><ul><li>Update artifact TTL </li></ul><ul><li>Suppression update </li></ul><ul><li>Alert manager </li></ul>N Y Condition ResPrev Done scheduler.log Condition ResPrev N Y Results Snapshot
  13. 13. Real-time Alerts <ul><li>Reduce response time </li></ul><ul><li>Continuously monitor a condition </li></ul><ul><li>Scheduler ensures real-time search is always running </li></ul><ul><li>Throttling is almost always necessary </li></ul><ul><li>Compatible with all alert actions </li></ul><ul><li>Visible through Alerts Manager </li></ul>
  14. 14. Alert Throttling <ul><li>Natively support alert action throttling </li></ul><ul><li>Useful in: </li></ul><ul><ul><li>Alert when database server is down, but don’t alert me about this condition for one hour </li></ul></ul><ul><li>Available for both standard and real-time alerts </li></ul>
  15. 15. Alerts Manager <ul><li>System-wide view of all triggered alerts </li></ul><ul><li>Basic alert management features </li></ul><ul><li>Ability to drill down and view why the alert was triggered </li></ul><ul><li>Real-time alert results are snapshots in time when triggered </li></ul>
  16. 16. Demo… (5-10 min) <ul><li>Show custom alert conditions, when to use them </li></ul><ul><li>Demo real-time alerts: </li></ul><ul><ul><li>Throttling </li></ul></ul><ul><ul><li>Alert manager </li></ul></ul>
  17. 17. Managing Search Load <ul><li>System wide concurrent searches limited to </li></ul><ul><ul><li>Total: 4 + 4 x number of cores </li></ul></ul><ul><ul><li>Limit used for ad-hoc and scheduled searches </li></ul></ul><ul><li>Scheduler queues over limit searches </li></ul><ul><li>Scheduler allocation is configurable in limits.conf </li></ul><ul><ul><li>[scheduler] </li></ul></ul><ul><ul><li>max_searches_perc = 25 // percentage of system wide concurrent searches to use </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Use the Scheduler Activity dashboards </li></ul><ul><li>Search App >> Status >> Scheduler Activity >> Overview </li></ul>Search allocation
  18. 18. <ul><li>savedsearches.conf </li></ul><ul><ul><li>Search string, schedule, alert condition, actions etc… </li></ul></ul><ul><li>alert_actions.conf </li></ul><ul><ul><li>Alert action options such as: email server, format, subject line, ttl etc… </li></ul></ul><ul><li>limits.conf </li></ul><ul><ul><li>Scheduler’s concurrent search limit </li></ul></ul><ul><ul><li>Action execution related limits </li></ul></ul><ul><li>scheduler.log </li></ul><ul><li>Look in $SPLUNK_HOME/etc/system/README/<filename>.conf.spec for more detailed info </li></ul>.conf & .log File Summary
  19. 19. <ul><li>Per result alerting and throttling </li></ul><ul><li>More alert actions to enable more complex alerting conditions </li></ul><ul><ul><li>Once five failed login attempts occur enable a monitoring search that alerts on suspicious user activity </li></ul></ul>Sneak Peek Into New Features
  20. 20. <ul><li>How scheduled and real-time alerts work </li></ul><ul><li>Create simple and advanced real-time alerts </li></ul><ul><li>Enable alert throttling and check for throttled alerts </li></ul><ul><li>Check fired alerts using Alerts Manager </li></ul><ul><li>Change scheduler limit defaults </li></ul><ul><li>Be an IT hero  </li></ul>Now You Should Know …
  21. 21. August 15, 2011 Questions? Ledion Bitincka, Search and Alerting Team

×