34. The 6th Annual Splunk Worldwide Users’ Conference
September 21-24, 2015 The MGM Grand Hotel, Las Vegas
• 50+ Customer Speakers
• 50+ Splunk Speakers
• 35+ Apps in Splunk Apps Showcase
• 65 Technology Partners
• 4,000+ IT & Business Professionals
• 2 Keynote Sessions
• 3 days of technical content (150+ Sessions)
• 3 days of Splunk University
– Get Splunk Certified
– Get CPE credits for CISSP, CAP, SSCP, etc.
– Save thousands on Splunk education!
34
Register at: conf.splunk.com
Introduction presenters; this presentation covers IT Operations /Analytics if you are in the wrong presentation we can help you get to the right one.
The intent of this “Hands On Session” is for us to walk through one of those dreaded 2am calls but instead of having a bridge full of people, use Splunk to identify the issue and send it to the appropriate team, create a ticket to track our work, an alert to ensure it does not happen again, and reuse the data to send to our Customer Service team the details of which customers are affected so we can proactively notify them and maybe ensure their loyalty?
But first lets cover a couple slides to set the stage for this – then we can get to the fun stuff.
Most companies start using Splunk in one of these 5 areas and typically as more teams use Splunk it traverses each of these 5 areas. Both IT and business professionals can analyze machine data to get real-time visibility and operational intelligence. With our platform for machine data, organizations can meaningfully improve their performance in a wide range of areas e.g. meet service levels, reduce costs, mitigate security risks, maintain compliance and gain insights.
Today we are going to focus on some of the major use cases and values related to the IT Operations Space.
In IT Operations this maturity model is a great template/mainstay when it comes to how Splunk is utilized. Most teams have downloaded Splunk on a laptop and from there it get scaled to a server and to multiple server etc. The idea from an ITOps mature model is very much the same,
Search and investigation. Using Splunk, organizations identify and resolve issues up to 70% faster and reduce costly escalations by up to 90%. Splunk is one place to find and fix problems, and investigate incidents across all your IT systems and infrastructure.
Proactive monitoring. Monitor IT systems in real time to identify issues, problems and attacks before they impact your customers, services and revenue. Splunk keeps watch of specific patterns, trends and thresholds in your machine data so you don't have to. Trigger notifications in real-time via email or RSS, execute a script to take remedial actions, send an SNMP trap to your system management console or generate a service desk ticket.
Operational visibility. See the whole picture, track performance and make better decisions. Visualize usage trends to better plan for capacity; spot SLA infractions, track how you are being measured by the business. Do all of this using your existing machine data without spending millions of dollars instrumenting your IT infrastructure.
Real-time business insight. Make better-informed business decisions by understanding trends, patterns and gaining Operational Intelligence from your machine data. See the success of new online services by channel or demographic, reconcile 3rd-party service provider fees against actual use, find your heaviest users and heaviest abusers, and more. Because machine data captures every behavior, the possibilities are game changing. You'll find the lead times to get to this intelligence dramatically less than other solutions - measured in minutes/hours instead of months.
Who is at Search and Investigate- Raise your Hands, Proactive Monitoring and Alerting- Raise your Hands, , Operational Visibility, Raise your Hands, Real-time Business Insight, Raise your Hands. Who thinks it makes sense to all of us to have our business at Real-time Business Insights? Why?
So how do we get there?
Splunk is a platform that consists of multiple products and deployment models to fit your needs.
Splunk’s capability to digest all machine data and allow users to quickly analyze it for insight is it’s most compelling feature. We call this the universal machine data platform
For this Hands on Demo we are going to focus on Splunk Enterprise/Splunk Cloud:
Splunk Enterprise – used for on-premise deployments
Splunk Cloud – A managed service with all the capabilities of Splunk Enterprise…in the Cloud with a 100% SLA
What - The Common Information Model (CIM) allows you to normalize your data to match a common standard, using the same field names and event tags for equivalent events from different sources or vendors.
Why - The CIM acts as a search-time schema ("schema-on-the-fly") to allow you to define relationships in the event data while leaving the raw machine data intact.
Once you have normalized the data from multiple different source types, you can develop reports, correlation searches, and dashboards to present a unified view of a data domain.
You can display your normalized data in the dashboards provided by other Splunk-developed applications such as the Splunk App for Enterprise Security and the Splunk App for PCI Compliance.
What for ITOps – Heterogonous environments –
- Who has one type of Server, Storage, Switch, Firewalls?
- Database
- Select
Where is Splunk Fit– Splunk’s Schema on the Fly Harnesses this capability to rename/alias common field names and event tags for equivalent events from different sources or vendors to provide a singular view of Storage, CPU (windows & *nix),
What is a Splunk APP - A Splunk App is a prebuilt collection of dashboards, panels and UI elements powered by saved searches and packaged for a specific technology or use case to make Splunk immediately useful and relevant to different roles.
What is a Splunk Add-on – Capture/Index Data Identify relative events, field extractions, tags, CIM Compliancy
Why do they work – Come prepackaged with inputs, props, transforms to standardize the obtaining the data, indexing of data, Search Time Extractions, saved searches, macros
Where do you put them – They tell you where to put them, NIX addon goes on Forwarder, Indexer, Searchhead, Deployment Server
CIM + Add-ons = ITOps Fast Time To Value for not only the events, alerts, and correlation but also providing development/business and other teams the ability to see IT in a single location.
Definitions – These are pretty standard vernaculars – feel free to raise your hand if you have questions. During this discussion these are what we will be using to discuss the framework put into place.
Bonus Question – Why do we have KPI’s / SLA’s? Can we use them to measure impact of introducing Splunk to the ITOps Team?
Alright now to the fun Stuff…. Remember we will be working through the 2am Call
How many of you have experienced this in your career, raise your hands?
Anyone care to share an example? Network problems? Capacity problems? Database Problems?
Everyone Lets pull out our Laptops and lets log into Splunk.
For our hands on exercise - we have received the call from our Team and they report that one of our Services called “Webstore” is having issues with customer’s not being able to complete orders and the blame game may have started with the different internal teams?
Alright lets get everyone logged in. Once you are logged in just go ahead and look at us. If you have any issue please raise your hand and we can come help ya out.
Alright lets get everyone logged in. Once you are logged in just go ahead and look at us. If you have any issue please raise your hand and we can come help ya out.
Okay lets type in index=oidemo
We have all seen similar datasets right?
We can see we have 6-7 different sourcetypes…
Some Web logs, some json, some system logs, etc… all different varieties, variability, velocity,
So what’s next? Lets all choose a event, open it up. Its pretty great that we have the different fields being extracted at search time from the data but how much more useful to us if we were able to understand on the fly what applications this entity/host was associated with?
Lets click on the “Event Action”. <Briefly Describe Splunk Workflows> - Look at that we can see “Get Application Information” Lets click on it.
I know we are supposed to be troubleshooting our issue. Trust me this foundational detail will help us understand how we can track an event from the Host to Application and maybe event beyond. So quickly - Everyone can see that we have the Host/Entity as the name associated with the event. And we can see that the Entity is associated with application <Blah> and look there are other host/Entities also associated.
Lets click on the timechart graph anywhere and see if we can have Splunk show us the event counts based on the individual hosts/entities we see above instead of all together?
Nice! Now we can see the individual host/entity details – the raw events – and even better the service which this host/entity is part of. Again lets do some drilldown and click the Service in Blue, maybe it will tell us what other hosts/entities are associated with this Service.
Lets pause for a minute, I know we did a lot of clicking and want to ensure everyone is where we are. Does anyone have questions? (Hope someone ask’s how Splunk is mapping the Entity-Application-Service) If not ask does anyone know how Splunk understands the relationship (Entity-Application-Service)?
Lets take a moment to discuss a CMDB? Does anyone want to share with the group their definition of CMDB? Anyone happen to have this Correlation in Splunk in their company? Anyone want to share why this maybe important to your organization? Would it be awesome to be able to visualize ALL Services?
Lets click on the drop down and Select “All”
Awesome we have “All” the Services
So we discussed SLA and KPI in our definitions right? Would this mapping be valuable to alerting, reporting, and visualizing those? If we understand the underlying entities/hosts we can use that detail in our searches to define what is important? Things like if one machine is having high CPU but the other two are fine, do we need an alert? Unknown but now we are able to think like that rather than maybe a more conventional – “We need to know if a machine has CPU over 85% Utilization”?
So now to the troubleshooting – Lets click on the IT Operations Dashboard
This is a customized for the items important for this NOC
Entities/Hosts -> Applications ->Services
We can evaluate the individual components that make up a Service from Host components Network/Storage/Compute
Why is this important?
MTTR
Capacity Planning
Everyone on the Same Page
Blame Games
We have a division of response codes? Everyone Familiar with 200, 400, 500 Codes?
We also can see that we are experiencing the successful and errorring connections at all geographical points so we can rule out a regional issue. I also can see that we are getting some successful but the major issue is that we have a large number of “Service Unavailable” maybe this is a downstream issue, there is middleware, and database that also account for this Service. Lets get down in the weeds.
Click on “Investigate Webstore Details”
Um This is interesting – Anyone wanna tell me which one of these Applications is not like the others? Our transactions across Apache Web and our Middleware are in the Green and Yellow but WOW the Database looks to be having issues. Oh nice someone is running a number of expensive queries. Lets Dive into MySQL ..
Click on “MySQL Application”
Now we can see the relevant details for the MySQL details – The current Searches – Search Duration – CPU – Memory details by User. So what can we do?
Okay so we have an idea of “What is happen” - We are investing our time and need to make sure we have visibility to the issue right – Does it make sense to create a ticket? We can make use of “event actions” to do exactly that “Action on the event” Lets click on the Hax0r’s expensive Query – Splunk’s Token searches to the rescue – Lets open this first event – click “Event Actions” – Nice we have the ability to “Create Ticket”
Click “Create Ticket”
This is “ACME” Ticket Creation because Splunk has this capability with any Ticketing Systems, we have apps like ServiceNow to integrated with some of the more popular Ticketing Systems but this is easily built into even a custom Ticketing System. Even Better Splunk Already started filling out the ticket details. Lets finish the process.
Complete the details (Username, Criticality, Details)
Click Submit and Mention refresh of the page now shows my ticket – validation that the ticket was successfully submitted
Everyone able to create a ticket?
That is pretty awesome but that is just for our team’s tracking – lets go back to our previous tab
Close the Ticket Creation Tab
Click on previous Tab “Database Metrics”
Lets do something a bit more beneficial so we are not waking up if this happens again. I think we should make an alert for this event but how? Ahh lets try “Event Actions” just maybe?
Click “Event Actions”
Nice there it is! “Create Alert”
Ahh Another Pop out window and we are back at Search – Lets create that alert
We can see this Macro is building a statistics table per User – for Median Time of Query and Median Time over all Time. So lets take that detail and see if we can find the user that is running queries over the median time.
Add “|where user_time_taken > median_time_taken” to the search string and click search
There is Hax0r – now to save the alert
Click “save as”Select “Alert”
Give the Alert a Title: <yourname>User_DBQuery
Description: <Your Choice>
Alert Type: Scheduled
Time Range: Thursday at <now + 5m>
Trigger conditions: Defaults
Click Next
List in Triggered Alerts: Check
Send Email: Check
To: <your email>
Priority: Default
Subject: Default
Message: Default
Include: Your Choice
Run A Script?
Discuss a simple script could be called here to connected to the MySQL Box to Stop this User’s Query due to its long running and intensity. Would that be beneficial? Self-Healing activity?
When Triggered: Default
Click Saved
Return to search
In search bar Replace “stream:mysql” with “access_combined”
The results of this search will provide a list of all CUSTOMERS which are having issues with their interactions
This list of CUSTOMERS can be sent to Customer Service Team for Follow up – Proactive Email to explain that the organization was aware of the issue and apologize etc. Maybe mention with this effort ITOps is now providing near real-time CUSTOMER Benefit and impacting Organization Customer Loyalty. Is this an example of Realtime Business Insight?
----- Meeting Notes (4/22/15 10:47) -----
Splunk Apptitude is live and open.
You've got 90 days.
To win more than $150,000 in cash and prizes.
Last day to submit is July 20th, 2015.
We'll announce the winners at Black Hat in August.
Good luck!
And finally, I would like to encourage all of you to attend our user conference in September.
The energy level and passion that our customers bring to this event is simply electrifying.
Combined with inspirational keynotes and 150+ breakout session across all areas of operational intelligence,
It is simply the best forum to bring our Splunk community together, to learn about new and advanced Splunk offerings, and most of all to learn from one another.