Taking Splunk to the Next Level - Architecture Breakout Session

Copyright © 2015 Splunk Inc.
Taking Splunk
to the Next Level
(Architecture)
David Millis

2
Splunk at the Next Level
Time to move beyond initial Splunk environment
• More use cases – how to tackle?
• More data – how do we scale?
• Splunk is mission critical == HA
• Global deployments
• Splunk user experience Screenshot here

3
Agenda
Use cases  Business Cases
Simple Scaling
Indexer Clustering
Search Head Clustering
Distributed Management Console
Centralized Configuration Management
Splunk Cloud & Hybrid Deployments
Architecture workshop
Q&A

4
Growing your Splunk Deployment
Many customers start with a single use case…
• Ex: Monitor the web servers
• Help ensure up-time & response times
• Track usage, errors
• Provides business value

5
Justify! Why should the CIO care?
Your services exist in a larger context than just one app, or one tier.
What is the value of the service as a whole?
What are CIO commitments for the service?
• The company’s web store is one of the most critical parts of the business.
• Performance of the overall environment must be maintained at all times.
• Failures in any portion of the web store must be quickly identified, send
notification to the appropriate parties.
• Dependencies on external processes must be monitored as well.

6
The larger context
• Failure in one system cascades
• Map dependencies, estimate costs
• Use Splunk to track all dependencies.
• What happens when it is down?
Dependencies often include:
• Networking dependencies
• Shared storage
• Databases, middleware, custom apps
• Virtualization layer
Screenshot here

7
Scaling
Multiple factors
Indexer: IOPs, daily rate
Storage: Usage & retention
Search Head usage

8
Scaling - Indexers
Sizing for indexing performance
Indexers are usually storage-bound
Indexers: 150 to 250 GB per day each. (With suitable storage)
Ref HW: 12 cores (2 GHz+), 12 GB RAM, 800+ IOPs
Optimal HW (normal disk): 16 CPU cores, 48 GB RAM
Optimal HW (SSD): 24 CPU cores, 132 GB RAM

9
SSD Advantage
http://blogs.splunk.com/2012/05/10/quantifying-the-benefits-of-
splunk-with-ssds/
• Low cost random seeks
• Writes are not that much faster – no great improvement with Indexing
• Significant improvements with Sparse/needle-haystack searches
• Dense searches become CPU bound
• Searches run faster allowing for more completed searches/min

10
Scaling - Storage
Simple storage to complex
Raw data rate  net compression of ~ 50% on disk.
To calculate disk usage: rate * compression * retention (days)
– 200 GB / day * 50% * 100 days = 10TB
Consider cold storage on slower arrays
– Design hot/warm/cold retention policy to minimize the number of
searches which will hit cold buckets
– Target at least 7 days retention on fast
Clustering
– Changes storage story

11
Scaling - Storage
Sizing Calculator: http://splunk-sizing.appspot.com/

12
Scaling - Storage
RAID + SSD deep dive
• For spinning disks, Splunk recommends RAID 1+0 with 1000 IOPs
• SSDs provide extremely high IOPs (45,000 +)
• A note about RAID 5
• RAID 5 SSD arrays give great Splunk performance in most scenarios
• RAID 5 spinning disk arrays perform poorly for indexing
Additional details: Splunk Docs, Capacity Planning Manual

13
Indexer Clustering
High-Availability, Out of the Box
Splunk indexer clustering
Active-Active= better performance
Specific terms:
– Master Node
– Peer Node
– Search Factor
– Replication Factor
Additional details: Splunk Docs, Distributed Deployment Manual

14
Cross-site Clustering
Search Affinity by location
“Search locally”, “Store Globally”
DR scenarios
Search Affinity can result in
slower searches in some
scenarios

15
Scaling the Search Heads
Splunk Search is critical, too!
Splunk Search high availability needs
Scale to handle # of concurrent searches
Search Activity App– Get It!
https://splunkbase.splunk.com/app/2632

16
Search Head Pooling vs Clustering
SHC
• SHP
• Available since v4.2
• Sharing configurations through NFS
• Single point of failure
• Performance issues
• No NFS
• Replication using local storage
• Commodity hardware
NFS

18
Search Head Clustering
Use “Captain” (not “Master”) to avoid confusion with Index-Clustering
Minimum 3 nodes required. Odd is preferred (for the number of nodes)
Cluster takes certain key decisions based on *majority* (consensus)
In multi-site setup have more nodes in main datacenter

19
Distributed Management Console
Manage Splunk 6.2 environments
Replaces Deployment Monitor App
Incorporates SOS app prior to 6.2

20
Deployment Server
Central management of Splunk Forwarders
Deployment Server manages Apps, Configs
Select one or more classes for each host
Class defines apps & configs
Works by “phone-home” from forwarder
Notes:
DS does not push forwarder binaries
Use Cluster Master to manage clustered indexers, not DS
Use Deployer to manage clustered search heads, not DS

21
Cloud & Hybrid
Scale without waiting for hardware

22
Let’s Architect
2
• 1Tb/day peak ingest
• Up to 50 concurrent users
• All data is being generated from a single data center
• Want a fault tolerant design for high availability
• 90 days data retention
Scenario:

24
Forwarding Tier
2
Design Factors
• Syslog Collectors (HA)
• DBConnect Inputs
– McAfee EPO data
• TA Inputs
– CheckPoint
• Assorted Inputs
– Microsoft AD logs
– MicroSoft Exchange Server
– Microsoft Sharepoint logs
– Log4j, Linux, IIS

25
Syslog Collectors
2
• Best Practice is to use dedicated syslog servers
• Syslog-NG/rSyslog recommended
• Syslog can write events to dedicated log files allowing for
easy sourcetype classification on inputs

26
Syslog Collectors
2
• Using a Load Balancer/VIP
with Linux Heartbeat to
provide failover for the syslog
listener
• Syslog-NG Profession Edition
(PE) provides client-side
failover
High Availability

27
Standalone Forwarder for “Interesting” TA’s
2
• Interesting = TAs which use exotic
“pull” input methods, such as
– TA-McAfee requires DBConnect
– TA-Checkpoint uses the LEA Client
• Allows events to be load-balanced
across indexing tier
• Not an HA design, but could use a
VM to standby or failover
• Consider combining with the Syslog
Server

28
Deployment Server
2
● Deployment Server to manage Linux and
Windows forwarders
● Not a HA design, but could be hosted on a VM to
standby or failover

30
Forwarding Tier BOM
3
Role Type Config #
Syslog Server Medium Virtual
4 vCPU, 12GB RAM
200GB virtual disk
2
HWF Small Virtual
2 vCPU, 8GB RAM
20GB virtual disk
1
Deployment
Server
Medium Virtual
4 vCPU, 12GB RAM
200GB virtual disk
1
Load Balancer - - -

31
Forwarding Tier Design Best Practices
3
• Use a Syslog Server for Syslog data
• Be careful with Intermediate forwarders
– They can introduce bottlenecks
– Reduce the distribution of events across Indexers
• AutoLB will spread over all available indexers, but don’t assume
evenly!
– Enable forceTimebasedAutoLB
• May need to increase UF thruput setting for high velocity sources
– maxKBps, queue settings
– Multiple UF instances on a single high-volume server

33
Indexing Tier
3
Design Factors
• 1 Tb/day (1000Gb/day) peak ingest
• High Availability – Indexer Replication
(RF=3/SF=3)
• 10% Disk Space Contingency
• 90 days minimum data retention
– Cluster Sizing Calculator:
– http://splunk-sizing.appspot.com

34
Storage Calculations
3
• RAID Configuration Considerations
– Amount of raw disk
– Fault tolerance
– Available IOPS
• Filesystem Overhead
– inodes consume space
• Wiggle room
– Additional replicated buckets when a node fails
– Unbalanced replicated buckets
– Splunk internal logs, Summary Indexes, Report Acceleration, Accelerated
Data Models

36
Storage Types
3
• Direct Attached vs SAN vs NAS
• SSD/Flash vs Spinning Disk
– SSDs offer much higher IOPS with no latency
– Significant performance increases with Sparse Searches
– More expensive, but price dropping quickly

37
Cluster Master Server
3
• Indexer Apps are deployed via Cluster Master (CM)
• Very little disk/filesystem usage
• Not a HA design, but could be hosted on a VM to standby or failover

39
Indexing Tier BOM – Solution A
3
Role Type Config #
Indexer Medium Physical
16 core, 64GB RAM
12*1TB 10K SAS (RAID10)
20
Cluster Master Medium Virtual
4 vCPU, 12GB RAM
200GB virtual disk
1

40
Indexing Tier BOM – Solution B
4
Role Type Config #
Indexer Large Physical
24 core, 96GB RAM
6*800GB SSD (RAID6)
6*2TB 7.2K SATA (RAID10)
13
Cluster Master Medium Virtual
4 vCPU, 12GB RAM
200GB virtual disk
1

41
Indexing Tier Design Best Practices
4
• Depending on Search load, 100-250Gb of indexing volume per day
– More concurrent searches = less raw indexing volume
• Use fast disk (SSDs) for hot/warm, and slower/cheaper for cold
– Design hot/warm/cold retention policies to minimize the number of
searches which will hit cold
• If clustered:
– leave more headroom (disk space, processing, memory)
– Make sure that you have sufficient local and wide-area network bandwidth

42
How Clustering Affects Sizing
• Increased storage:
– 15% of raw usage for every replica copy
– 35% MORE to make that searchable
• Increased processing
– Incoming data to indexer is streamed to indexing peers to satisfy required
number of copies
• More hosts
– Need “replication factor” + 2 (search head, cluster master)
4

43
Benefits of Clustering
• Data redundancy
• Data availability
• Indexer resiliency
• Simpler management of indexers
• Simpler setup of distributed search
• Multi-site clustering allows site-specific search to reduce WAN traffic
4

44
Downsides of Clustering
• More complexity
• Increased Storage
• Extra machine (cluster master) required
• Increased local network bandwidth
• Hard to manage with DS (read: don’t)
4

46
Search Tier
4
Design Factors
• High Availability
• Search Head Clustering
• # users
• # concurrent searches
• Forward all data to indexers
• Apps being used

47
SHC & Deployer
4
• Search Head Cluster Apps need to be installed by the Deployer
• A minimum of 3 Search Heads are required for a SHC
• No Exchange or VMware app with SHC
– Anything leveraging tscollect based searches will need modification
– Improvements in v6.3

49
Search Tier BOM
4
Role Type Config #
Search Head Medium Physical
16 core, 64GB RAM
2*800GB 10K SAS (RAID 10)
3
Deployer Small Virtual
2 vCPU, 8GB RAM
20GB virtual disk
1
License Server
Small Virtual 2 vCPU, 8GB RAM
20GB virtual disk
1
Load Balancer - - -

50
Search Tier Design Best Practices
5
• ES will still require a Separate Search Head or dedicated SHC
• Use LDAP/AD/SSO for user Authentication
• Load Balancer configured for sticky sessions

54
Hybrid Approach
5
• Add the existing Splunk
instance as a search peer
until the data retention
period has expired
• Disable scheduled searches
on the old instance
• Migrate any Summary
Index data to new Indexers

56
Top 5 things to consider
5
• Indexer Storage requirements – Size and IOPS
• Minimum buy-in for a SHC is 3
• Use VMs for CM/LS/DS/Deployer if possible
• Consider a dedicated SH for a Distributed Management Console
• When in doubt – add another Indexer

57
How Apps Affect Sizing
• Enterprise Security – Requires a dedicated search head
• Don’t share hosts with other services
– Not co-located with Exchange, Active Directory, Hypervisors
• Don’t let anti-virus run on the Splunk partition
• Some data collection apps require a full instance (heavy forwarder)
– VMWare
– Checkpoint LEA
5

58
Sizing Considerations
• http://docs.splunk.com/Documentation/Splunk/latest/Installation/Cap
acityplanningforalargerSplunkdeployment
– Amount of incoming data
– Amount of indexed (stored) data
– Number of concurrent users
– Number of saved searches
– Types of searches
– Specific Splunk apps
• http://docs.splunk.com/Documentation/Splunk/latest/Installation/Perf
ormancechecklist

59
Required Reading
• Distributed Deployment Manual
– http://docs.splunk.com/Documentation/Splunk/latest/Deploy/Distributedoverv
iew
– Highlights
 Reference hardware specs
 How searches affect performance
– Dense / Rare / Sparse
 App considerations
 Summary table
• Sailing Alone Around The World – Joshua Slocum
5

The 6th Annual Splunk Worldwide Users’ Conference
September 21-24, 2015  The MGM Grand Hotel, Las Vegas
• 50+ Customer Speakers
• 50+ Splunk Speakers
• 35+ Apps in Splunk Apps Showcase
• 65 Technology Partners
• 4,000+ IT & Business Professionals
• 2 Keynote Sessions
• 3 days of technical content (150+ Sessions)
• 3 days of Splunk University
– Get Splunk Certified
– Get CPE credits for CISSP, CAP, SSCP, etc.
– Save thousands on Splunk education!
60
Register at: conf.splunk.com

61
We Want to Hear your Feedback!
After the Breakout Sessions conclude
Text Splunk to 878787
And be entered for a chance to win a $100 AMEX gift card!

Taking Splunk to the Next Level - Architecture Breakout Session

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Taking Splunk to the Next Level - Architecture Breakout Session

Similar to Taking Splunk to the Next Level - Architecture Breakout Session (20)

More from Splunk

More from Splunk (20)

Recently uploaded

Recently uploaded (20)

Taking Splunk to the Next Level - Architecture Breakout Session

Editor's Notes