Copyright © 2014 Splunk Inc.
Splunk at CSAA Insurance Group
Doug Erkkila
Capacity Management Analyst
PAS Team, CSAA Insurance Group
About CSAA Insurance Group
• Insurancecompanyofferingautomobile,
homeownersandotherpersonallinesof
insurancetoAAAmembersthrough AAA
clubs
• Morethan3600employeescoasttocoast
• Reachingnearly17million AAAmembers
in23statesandWashingtonDC
PAS Team Charter
• PAS = Policy Administration System
• Total team = 300, optimization team = 5
• Maintain and manage PAS application
performance, capacity and availability
• Consolidate policy administration from across
different insurance categories and across different
states into one central system
PAS Team Accountability
• KPIs
• Average response time (90th percentile)
• Overall total availability
• Number of users per session
• Application Errors (Error 500s)
Splunk at CSAA Insurance Group
• Sending6GBperdaytoSplunk
• StartedusingSplunkinNov2014
• Dashboardstoservemultipleteams
• ITOps,AppManagement,QA
• APMandAppDevUseCases
• APMusecase
• SendingDatafromDynatraceintoSplunkto
trackapplicationperformancemetrics
• AppDevusecase
• AlertssetupinQA/stagingtocatchissues
beforeproduction
5
Current Data Sources and Users
Intelligence
QA
PAS Team
Exec Reports
Data Power
Logs
BCI
(Dynatrace)
Custom
App Logs
App Servers
Custom
Application
SOA
Architecture
NCSA
• Alerts in Splunk to investigate
outages
• Increased confidence in reported
availability and uptime
• Insights from the logs now
available to the broader team
• Reduce Error 500s by 75%
With Splunk
“Since aggregating all of
the logs into Splunk, we
have much higher view
and have been able to
trend on that ”
Feeding Dynatrace Data into Splunk
• Now have live views of performance data
• Can now immediately react to issues
• MTTR reduced from > day to minutes
• Allows us to manage the Error500 KPI
Splunk, APM & the Error 500s
Gathering APM Data from Dynatrace
• Installed on JVMs
• Performance timing data
• Once a day export of data, weekly reports
Screenshot Here
Instrumenting Logs for Splunk
Log.debug(“orderstatus=e
rror errorcode=500
user=%d”, userId)
• Workingwithdevelopmentteamto
optimizecustomapplication logsforusein
Splunk
• Currentlyadhering tosinglelineperevent
backpractice
• Identifying additionallogging practicesto
implement
• Logsarenowhighprofile!
• More Web service insight – Splunking IIS Server Logs
• More App insights – Splunking the JVM
• Using Splunk to support transition to DevOps
• Continuous Integration – Splunking Jenkins
• Package up dashboards into a PAS app for Splunk
• Implement more logging best practices
Looking Forward with Splunk
• Gain traction internally by initially focusing on
a single problem or use case (i.e. 500 errors)Focus at first
• Expand your line of vision by putting more
data sources into SplunkGrowth
• Prepare for rapid internal adoption once
teams see what Splunk can doScale
Tips and Takeaways
Wrapping Up
“Once people get hooked on data,
they just want more and more.”
Thank You

CSAA Customer Presentation

  • 1.
    Copyright © 2014Splunk Inc. Splunk at CSAA Insurance Group Doug Erkkila Capacity Management Analyst PAS Team, CSAA Insurance Group
  • 2.
    About CSAA InsuranceGroup • Insurancecompanyofferingautomobile, homeownersandotherpersonallinesof insurancetoAAAmembersthrough AAA clubs • Morethan3600employeescoasttocoast • Reachingnearly17million AAAmembers in23statesandWashingtonDC
  • 3.
    PAS Team Charter •PAS = Policy Administration System • Total team = 300, optimization team = 5 • Maintain and manage PAS application performance, capacity and availability • Consolidate policy administration from across different insurance categories and across different states into one central system
  • 4.
    PAS Team Accountability •KPIs • Average response time (90th percentile) • Overall total availability • Number of users per session • Application Errors (Error 500s)
  • 5.
    Splunk at CSAAInsurance Group • Sending6GBperdaytoSplunk • StartedusingSplunkinNov2014 • Dashboardstoservemultipleteams • ITOps,AppManagement,QA • APMandAppDevUseCases • APMusecase • SendingDatafromDynatraceintoSplunkto trackapplicationperformancemetrics • AppDevusecase • AlertssetupinQA/stagingtocatchissues beforeproduction 5
  • 6.
    Current Data Sourcesand Users Intelligence QA PAS Team Exec Reports Data Power Logs BCI (Dynatrace) Custom App Logs App Servers Custom Application SOA Architecture NCSA
  • 7.
    • Alerts inSplunk to investigate outages • Increased confidence in reported availability and uptime • Insights from the logs now available to the broader team • Reduce Error 500s by 75% With Splunk “Since aggregating all of the logs into Splunk, we have much higher view and have been able to trend on that ”
  • 8.
    Feeding Dynatrace Datainto Splunk • Now have live views of performance data • Can now immediately react to issues • MTTR reduced from > day to minutes • Allows us to manage the Error500 KPI Splunk, APM & the Error 500s Gathering APM Data from Dynatrace • Installed on JVMs • Performance timing data • Once a day export of data, weekly reports Screenshot Here
  • 9.
    Instrumenting Logs forSplunk Log.debug(“orderstatus=e rror errorcode=500 user=%d”, userId) • Workingwithdevelopmentteamto optimizecustomapplication logsforusein Splunk • Currentlyadhering tosinglelineperevent backpractice • Identifying additionallogging practicesto implement • Logsarenowhighprofile!
  • 10.
    • More Webservice insight – Splunking IIS Server Logs • More App insights – Splunking the JVM • Using Splunk to support transition to DevOps • Continuous Integration – Splunking Jenkins • Package up dashboards into a PAS app for Splunk • Implement more logging best practices Looking Forward with Splunk
  • 11.
    • Gain tractioninternally by initially focusing on a single problem or use case (i.e. 500 errors)Focus at first • Expand your line of vision by putting more data sources into SplunkGrowth • Prepare for rapid internal adoption once teams see what Splunk can doScale Tips and Takeaways
  • 12.
    Wrapping Up “Once peopleget hooked on data, they just want more and more.”
  • 13.

Editor's Notes

  • #3 In the past 10 years, we have grown from a regional insurance provider in three states- to a coast-to-coast organization reaching nearly 17 million AAA members in 23 states and Washington DC
  • #8 And that we actually know when we get a heart beat alert, we can track, "Hey, we got a heart beat alert here. We see in these tools that it was out. And we see out notifications within Splunk so that we can collaborate when we're getting those." We've been able to validate that out notifications are in fact legitimate.   So our service outages now get, not just reported outages, but are automated announcement that our system is out. So with that, our availability numbers haven't really change much, but we have a lot more confidence in them now.
  • #12 \