Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012


Published on

A presentation titled "Splunk All the Things: Our First 3 Months Monitoring Web Service APIs" that Dan Cundiff and Eric Helgeson from Target Corporation gave at Splunk .conf2012.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Abig story to draw you in!Anonymizedlat/long data of guest searching for stores in the last 15 minutes.If a store wasn’t nearby those 61 people in Idaho, did they go somewhere else to by Tide, diapers, or socks?Conceptually, maybe we should build a store there (we don’t actually plan our stores with a sole data point like that, but it gives you an idea)?
  • Here’s the context for all the material that follows. “Enterprise Services” program is all about…
  • Logsscatted everywhere = complex ecosystemLooming horizon = data explosionStory: going live, millions of hits start coming in, try to figure out what is actually happening
  • 4 hours. No joke.Wewere drawn to innovate; just try something new and see what happens.
  • “You don’t know what you don’t know, but Splunk knows what you don’t know.” – that is, Splunk can help by telling you and helping discover what you don’t know.
  • Drill down: filter to the essential places across logs to troubleshoot or discover business intelCommunity and Google-able: Splunkbase, documentation rules!, lots of Google results = good
  • HTTP errors in a 24 period; what is normal? 500s are bad. Many 500s early on, but corrected, and much lower now.
  • A list of consumers of the Locations service over a 24 hour period.Story:Identify bad API key before the developer knew what was wrong.
  • We’re taking a look at our infrastructure design because of this.
  • Able to report on non-functional requirements.Goingforward we can do a better job of not over-estimating infrastructure needs; thus saving a lot more money, not wasting idle inventory on the shelf, and open the door to putting the right money in the right places then.
  • You saw the original map at the beginning of our presentation; aswe expose more APIs, what can we learn from them?
  • How are we adhering to this advice? We have accomplished many of these metrics already. Most of these are achievable with Splunk.
  • The more you have in Splunk, the more complete the monitoring picture can be.
  • Great for perf/load testing; see all the errors in one place.Youcan even put the Jenkins logs in Splunk and show the results across all APIs being developed.
  • Allow apps to have multiple ways to get logs into SplunkNo UF on consumer devicesBuild transactions across multiple layers of the infraUse UFs on end points everywhere = FASTESTElse, consolidate and mount Splunk = FASTElse, use CLS RESTful API = SLOW
  • A flattering meme, but at that point, after the demos and the successful research, Splunk sells itself, and honestly at that point everyone is happy to move on, buy what’s needed, and get down to Splunking.
  • Nothing is wrong. Your data is wrong. Getting people to trust what Splunk is telling us.Storyabout 1 of the nodes being down and initially people didn’t believe it was right.
  • If developers bring Splunk in, take time and educate ops people on how it all works so they understand how the infrastructure is different and how it should be built. We suspect, normally it’s the other way around.
  • Get those indexers behaving like Swift or Cassandra: multi-tenant, N-3 replication of the data so cheap servers can fail, scale, etc.
  • Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012

    1. 1. Splunk All the Things: Our First 3 Months Monitoring Web Service APIs Dan Cundiff (@pmotch) and Eric Helgeson (@nulleric) Target CorporationCopyright © 2012 Splunk Inc.
    2. 2. 2
    3. 3. AgendaContextProblemSolutionExamplesIn progress and future stuffLessons and challenges 3
    4. 4. Context: Enterprise Services @ TargetData and transactional APIs for all the domains in our business– Products (inventory, price, description, etc)– Locations– Coupons– etcAPIs exposed inside and outsideMostly RESTful APIs, some pub sub/messagingUsed by mobile devices, applications, partners on the outside, etc.Constantly evolving, rapidly improving, all the time 4
    5. 5. ProblemFirst API go-live:– Millions of log events per day (grep/cut/sed/awk not cutting it)– Logs scattered everywhere– Limited access to logs– Needed end to end visibility of web services– Needed ability to discover information in logs– Can we be pro-active? Faster reactive?Looming horizon:– BILLIONS of log events coming– Questions changing everyday from business, support, execs, developers 5
    6. 6. Solution: Gave Splunk a tryInstalled Splunk on a lab serverHooked up Splunk to the logsQuickly created 15+ searches and reportsGenerated a dashboard for visibility and trendingTotal time to do all this in Splunk: ~4 hours 6
    7. 7. Why SplunkUnderstanding what’s “normal”– Identify tolerances– Identify actionable events vs. anomaliesYou don’t know what you don’t know– …but Splunk can tell you what you don’t know 7
    8. 8. Why Splunk, part 2Indicators when are things trending badly– Proactive monitoring and recovery– Standard deviations, percentage changes over time, outliersFull stack visibility– API gateway– Network (load balancers, firewalls)– Web/app– OS 8
    9. 9. Why Splunk, part 3Quick and flexible dashboardsDrill downCommunity (Splunkbase, blogs, etc)Google-able™App store! 9
    10. 10. Locations Service Examples
    11. 11. What is “normal”?Volume 11
    12. 12. What is “normal”?, part 2API response time SLAs 12
    13. 13. What is “normal”?, part 3Errors happen, but what is acceptable? 13
    14. 14. 404s~1700 errors once a day every week404s for stores that don’t existBot?– Who are they?– Malicious? Competitor? Individual?– Reach out to understand why 14
    15. 15. Understanding consumersWho and how is it being used?What’s their experience? 15
    16. 16. Understanding consumers, part 2Load testing in production? 16
    17. 17. Understanding infrastructureExpected design vs actual implementationNot balancing workload as expected 17
    18. 18. Understanding providersHow are providers responding?Is overhead added to the API response? 18
    19. 19. Requirements feedback loopRequirement: 200 tpsActual: ~20 tps 19
    20. 20. Business intelligence from APIsWhere are people searching?Where should we build our next store?How far are people traveling?What time of day?Mobile vs website?iOS vs Android?International? 20
    21. 21. Metrics for APIs(source: http://blog.programmableweb.com/2012/08/02/the-api-measurement-secret-know-what-metrics-matter/)Traffic Metrics Service Metrics Support Metrics– Total calls – Performance – Support tickets– Top methods – Availability – Response time– Call chains – Error rates – Community metrics– Quota faults – Code defects Business MetricsDeveloper Metrics Marketing Metrics – Direct revenue– Total developer count – Developer registrations – Indirect revenue– Number – Developer portal funnel – Market share of active developers – Traffic sources – Costs– Top developers – Event metrics– Trending apps– Retention 21
    22. 22. In progress and future stuff
    23. 23. Splunk all the thingsConsumer appsProvider systemsOS, firewalls, proxiesExternal API gateway logsAnything in between (middleware, integrations, etc)Correlate with logs from apps degrees away (e.g. .com web logs)Development (perf test results, git, Jenkins/CI, wiki, etc)
    24. 24. DashboardsGlobal dashboard summarizing all APIsBI dashboardsExecutive dashboards 24
    25. 25. Dashboards, part 2Environment dashboards for each API– CI– Test– Stage– Prod 25
    26. 26. Dashboards, part 3Alert trending dashboards for each API 26
    27. 27. Splunking Continuous IntegrationDrill down into CI results linked straight from Jenkins– Filtered by date OR transaction GUID 27
    28. 28. Splunking Continuous Integration, part 2We practice code as documentationEvery commit, Jenkins runs, extracts documentation from code, puts itin the respective wiki pages (pretty cool! – automated / no humans)Splunk monitors wiki changes using the MediaWiki APIMonitor CI + human wiki changeshttps://github.com/pmotch/wikislurp 28
    29. 29. Common Logging ServiceCLS is our strategy for getting logs from all places into SplunkHow– Use UFs on end points everywhere– Else, consolidate and mount Splunk– Else, use CLS RESTful APIEnables end-to-end visibility– Insert GUIDs across all the hops in the transactionUse out of the box log formats (e.g. Log4j) 29
    30. 30. Lessons and challenges
    31. 31. LessonsRTFM– Keep logs flat– Keep timestamp (ISO8601) at the beginning– k=vIterate quick, push to prod; minimal tweaks to SplunkFlatten out of box audit events (XML)– Toggle at runtimeDon’t re-invent the wheel, use what your system provides, Splunk canhandle it! 31
    32. 32. Lessons, part 2Don’t pre-optimize up front– Governance– Standards– Alerting– Access controlsOptimize as needed 32
    33. 33. Lessons, part 3Create a community 33
    34. 34. Lessons, part 4Create best practices, standards, etc in a wiki 34
    35. 35. Challenges: Organizational“Stop. We already have tools that do this. Use those.”– tgtMAKE saves the day– tgtMAKE = R&D– R&D = $, servers, flak shelter, people networkMake it real strategy– Demo to as many key players as possible– Drum up interested– Show actual value 35
    36. 36. Challenges: Organizational, part 2 http://knowyourmeme.com/photos/361379-shut-up-and-take-my-money 36
    37. 37. Challenges: Organizational, part 3The data can’t be trusted? 37
    38. 38. Challenges: OSRHEL 6SELinuxIpfwInstall notes: http://nulleric.tumblr.com/post/13855621770/splunk-on-redhat-6-install-notes 38
    39. 39. Challenges: InfrastructureVM requirementAdhering to MDHA requirementsUniversal Forwarder skepticism 39
    40. 40. Challenges: Logs on the outsideUniversal Forwarders on servers that we don’t manageFirewallsMulti-layered DMZs 40
    41. 41. Challenges: Splunk… 41
    42. 42. Challenges: Splunk (err, improvements)Index improvements– Cheap servers, can fail, can expand– Replication, N=3– Replicas on N-1 subsequent nodes– Data is always available, smooth out across servers if they go down or expand– Multi-tenant– Think OpenStack Swift “Ring” concept or Cassandra– There’s that CAP Theorem thing; they say it’s a big deal.GUI for deployment client configurations (lazy and for n00bs, we know)Ability to extend charts with other libraries (like D3 or something) 42
    43. 43. RecapBe bold. Tooling matters. Sell it.Splunk all the things!Iterate, adapt, change quickly. 43
    44. 44. We’re hiring (come talk to us) 44
    45. 45. Questions? 45