Slide deck delivered at the June Splunk User Group in Edinburgh: Supporting Splunk at Scale, Splunking at Home & Introduction to Enterprise Security.
Sign up to the group here: https://usergroups.splunk.com/group/splunk-user-group-edinburgh/
Application Upgrades & patching Perform analysis on Splunk software and upon release will determine relevance and importance to the customers estate. When required, upgrades are applied to all Splunk software applications via Change control process.
Application Health Monitoring Monitoring the performance of the system on a 24/7 basis. This can consist of uptick in events, day to day capacity (disk space, memory, CPU, index size), Log sources, Users Searches.
Application Management User case management / rule management , App / TA upgrade. Including changes to ES searches.
First Line Break / Fix Responsible for fixing any deficiencies within the deployment of Splunk and where required will laisse with Splunk INC. This is done via an app developed by ECS which can alert on-call engineers out of office hours.
License usage To ensure that the customer does not exceed their license for Splunk software and to ensure that projected license use is clearly understood, the engineering function will monitor current utilization.
Incident Awareness Where appropriate will raise platform related incidents in the customers service portal and communicate issues with updates to key stakeholders.
Data on Boarding Support data on boarding activates to verify that data is successfully coming into Splunk environment via Universal forwarders and Syslog. This also includes input configuration to allow splunk to index data.
Data Restoration Subject to enough storage and the provision for data to be frozen, ECS Engineers shall restore from archive any data that has been frozen at the request of the customer.
Splunkd Scheduler Metrics license
Ulimits Traffic being dropped on Splunk ports and Engineers being alerting to outages on indexers despite all appearing to be available by the time the Engineer responded. No evidence to suggest they were caused by resource utilisation so Support team performed a destructive resync. This didn’t work. Further investigation found “Network-Layer error: Mo route to host.” Later discovered Ulimits settings for “Max User Processes” and “Open Files” were too low. These are restrictions put in place to stop users consuming to many resources. Number of files open is obvious in the case of indexers writing data. Max user process needs to accommodate all splunk threads. Threads grow with every concurrent http connection, parallel pipelines and KV store and concurrent searches.
Hardware Restrictions In the beginning the biggest bottleneck within our environment was System memory. Running indexers on around 30GB of memory at a time when we were indexing up to 800GB a day, running Enterprise Security with CIM data models. The number of searches hitting the Indexers would quite often trigger a Unix based tool name OMM Killer. It would kill splunkd processes as they were consume dangerous amounts of memory. To overcome this we lightened the load on the system by disabling some of the larger data models and restricting the number of concurrent searches the SHC could perform by 20%. This was a tactical measure until the customer could provision more hardware.
Duplication of Logs Noticed multiple copies of every single log. Running a search that counted raw events and distinct raw events and then dividing them by each other showed the duplication. A case was created with Splunk support and after some investigation they discovered this was a bug. The bug effected users who used indexer discovery on multi site clusters effecting 6.4.1 – 6.4.3. Whenever a peer in the cluster goes down the data will start duplicating, and when the peer comes back up, the data still continues to index multiple times and the issue doesn’t complexly go away. This was fixed in Splunk 6.4.4.
Warm Index Volume Full Splunk hot buckets were not rolling to warm and support team were made aware via alerting. The volume was at around 91% capacity whilst the Splunk configuration stated the bucket should roll at 88% (4.8TB). Further investigation showed that data from a previous investigation had been moved here and not deleted. This was around 495 GB of data. Once the data was removed from this volume the used capacity shrunk and Splunk begun to function as expected. From what we understand from this investigation Splunk only appears to measure it’s usage. Two settings in our case came into play. MaxVolumeDataSizeMB would never be achieved because technically Splunk could only use 4.5TB of data.
Splunkd refuses to start on Indexers Splunkd refused to start following OS patching. Other Indexers patched during this time appeared to restart without an issue. Splunkd and crash logs were observed from the console. First time I’d ever seen a log-Level as fatal. The message read “Detected directory manually copied into its database, cause id conflicts”. A number of duplicate db (primary bucket) and rb (replicated buckets) files existed on the indexer. In order for Splunk to operate it required one of these duplications to be removed. Seen as it coldn’t make the decision itself it just refused to start. Despite having potentially uncovered the root cause, as a matter of diligence we raised a support ticket with Spluk to ensure unecssary data wasn’t being removed. They confirmed our finds and passed instructions on how to proceed. We believe this was caused by the Indexing cluster not being placed in maintenance mode before the indexer was shut down. Maintenance mode essentially halts bucket replication and fix-up activity (Except for primary bucket fix-up)
Splunk have made an attempt at addressing these problems with their SIEM tool – Enterpirse Security
ES has been designed to take your through the entire process from monitoring for threats to actually handling the incident which has been discovered
To aid in the monitoring and event triage processes ES has been developed with a bunch of features to aid these, for example notable events to highlight and help aid prioritization of what needs investigating, provides a way of correlating across multiple log sources to track down the root cause of the problem, provides a means of enriching your data with and assigning context to who and what assets have been affected. It also allows you to bring it threat data from external sources to provide wider coverage and monitoring, has a risk scoring framework which helps aid prioritization of investigations
Dedicated Splunk SH to handle its excessive load, all background searching, data models building, macros running etc Config mainly relies around data onboarding, data must be aligned to the CIM – otherwise ES wont use it Helps do some of the intial data onboarding and CIM mapping with TA’s
What is a notable event? A correlated event – generated from a correlation search running in the background Urgency is calculated based on the severity of an event and the priority of the asset
Understanding where assets are and who owns them, their criticality and who should be accessing them help priortize security events and investigations. ES has the ability to integrate your asset and identity information through the use of lookup files They then populate the datamodels and are used in a vareity of out of the box searches and help populate dashboards and assign urgency to notable events
The Risk Scoring Framework enables a risk score to be applied to any event asset, behavior or user based on relative importance or value to the business. This helps security teams to prioritize alerts based on predefined thresholds, while also exposing contributing factors of the risk to all relevant teams. Easily track their security status to understand and actively manage overall business risk. Risk scores are applied to notables to determine the impact of an incident quickly
Use risk scores to generate actionable alerts to respond on matters that require immediate attention The Risk Object filter works by performing a reverse lookup against the asset and identity tables to find all fields that have been associated with the specified Risk Object. All associated objects found by the reverse lookup then display on the dashboard. For example, if you select a risk object type of system and type a Risk Object of 10.10.1.100, the reverse lookup against the assets table could return a MAC address. The Risk Analysis dashboard will update to display any risk score applied to the 10.10.1.100 address and a MAC address. If no match to another object was found in the asset table, only the IP address matches from the Risk Analysis data model will be displayed.
ES allows for collection, aggregation and de-duplication of threat feeds Supports STIX/TAXII, OpenIOC feeds Out of the box Activity and Artifacts dashboards Applies the data to correlation searches and alerts on your users behavior compared to your threat data
View Anomaly Detection View data in form of dashboards and reports to quickly identify anomalous behaviors and trends related to assets and identities in the enviroment.
Enhance incident response and investigations by leveraging and correlating data from a broad set of sources, including security and non-security data collected from across the organization, and supplemented with internal and external threat intelligence and other contextual information.
Accelerate Table Dataset - Users can now accelerate Tables from the Datasets Listings page. Time-Range Picker - Earlier users could either select preview rows and view random 50 events or specify a time-range and view the results in the summarize fields view. Now, users can select a view the events in the dataset by selecting a time-range. Edit Table - Users can easily navigate to the Table Editor by selecting the "Edit Table" option. Schedule Report - Users can schedule to run their datasets as a report and view the results on the Reports Listings Page. Export Dataset in various formats
Trellis - Show multiple similar visualizations at once to compare across different segments of a dataset with one single query.
Search Optimizer: built-in optimizations that analyze and process for maximum efficiency. filter results as early as possible -reduces the amount of data that needs to be processed. Predicate Splitting: The action of taking a predicate with multiple parts and, when possible, moving the parts to an earlier place in the search making it run faster and more efficient. Projection Elimination: analyzes your search and determines if any of the generated fields specified in the search will not actually be used to produce the search results. If generated fields are identified that can be eliminated, a optimized version of the search is run. Your search syntax remains unchanged. Event Tagging Control: Added a directive to control how much event-type and tagging occurs improving search performance. SPL "union" Command: - Merges the results from two or more datasets into one dataset. - One of the datasets can be a result set that is then piped into the union command and merged with a second dataset. -appends or merges events from the specified datasets, depending on whether the dataset is streaming or non-streaming and where the command is run. -runs on indexers in parallel where possible, and automatically interleaves results on the _time when processing events.
Key Features for SEARCH HEADS: Ensures continuous replication of knowledge objects across the SHC members Intelligent captain selection - Avoids out-of-sync SHC members from becoming captain Simplified SHC quota management - Provides independent controls for user/role and system-wide quota management Optimized bundle push and replication- Improved bundle push and replication performance Key Features for INDEXERS: Improved scalability - Scale upto 5+ million cluster-wide unique buckets and 15+ million total buckets Indexer node offline without search disruption - Avoids search disruption by automatically ensuring primary copy of all buckets are available, prior to taking a node offline Faster indexer recovery - Performance improvements to lower CM load and enable faster recovery incase of node failures
Supporting Splunk at Scale, Splunking at Home & Introduction to Enterprise Security