2. Joe Alex, Senior Big Data Engineer, Verizon
10/06/2015
1
Managing Security @1M Events/Sec
3. Introduction
• Senior Big Data Engineer, Tech. Lead @ Verizon
Managed Security Services
• Using Elasticsearch since ver 0.19
• Aspiring Data Scientist - Who is not ?
• Loves to work with data at scale
2
4. What we do - Manage Security for our Customers
• Collect Security Logs
• Correlate
• Store
• Index
• Analyze
• Monitor
• Escalate
3
6. Before Elasticsearch
• Traditional RDBMS won’t scale for
the billions of logs
filtered logs > events > incidents >
tickets
• All raw Logs were on disks
• Requests from customers took
days, weeks
• No way to search through billions
of Logs
• Advanced analytics not possible
5
http://www.liftoffit.com
7. After Elasticsearch
• Customers
have access to all their logs near real-time
can search and download their logs through the Portal
visualize/analyze using Kibana
• Operations
No more grep through disks
• Opens up the data for all kind of Analytics and Monitoring
Anomaly detection
Real-time alerting
Advanced monitoring
6
9. What we use and some numbers
• Multiple Elasticsearch Clusters
Search, Data Visualization, Analytics, Forensics
• Largest cluster has 128 Nodes
Current load about 20 billion docs per day
Has around 800 billion docs
• Index heavy use case (vs. search heavy)
• Hadoop for long term storage and analytics
• Spark for real-time analytics and monitoring
• Kafka for Queue
• Flume for collectors
8
10. How we progressed
• Earlier
Co-located with 28 Hadoop Data nodes
12 Core, 128GB RAM, 12 X 3TB Disks
Elasticsearch 0.19
• Later
Ran 2 Elasticsearch Nodes co-located with Hadoop data nodes
Effectively 56 Elasticsearch Nodes
• Now
128 dedicated bare metal boxes for Elasticsearch
8 core, 64GB RAM, 6 X 1TB Disks
Elasticsearch 1.5.2 (soon to ver 1.7)
9
11. Know your environment and data
• ENV
CPU
Memory
I/O
Network
• Elasticsearch typically runs in to Memory issues before CPU
Get the CPU – RAM – Disk ratios correct for your env.
Too much disk storage – ES may not utilize
• For data nodes prefer physical boxes
• For disks – SSD, RAID0, JBOD
10
12. Know your environment and data
• Data
Data ingestion rates
Type of data
o Our docs were mostly 1.5k – 2k, rarely 5k
o 10% of the customers produced 80% of data
o Variety of data
Volume
11
14. Things you should change
• change default location of data and logs
• change cluster.name
• avoid multicast use unicast
• discover timeouts adjust per your network
• use mapping/templates
plan your field types number, date, ipv4
• adjust gateway, discovery, threadpool, recovery settings
• adjust throttling settings
• evaluate breakers
• to analyze or not to
13
15. Things you should change
14
• JVM Heap set to 50% of available memory
Leave 50% for OS, page caching
Elasticsearch/Java tends to have issues after 31GB heap
• Disable _all, _timestamp, _source if you don't need it
• No swap - mlockall: true, vm.swappiness = 0 or 1
• Tune kernel parameters
file, network, user, process
vm.max_map_count = 262144
/sys/kernel/mm/transparent_hugepage/defrag = never
10G network tweaks
16. Dedicated Master, Client, Data Nodes
• Master
Only cluster management (don’t send search or indexing requests)
3 masters minimum
Avoid split-brain
• Client
Coordinators, Aggregation (send all search requests here, will co-ordinate)
Load balance behind Apache, Nginx, F5 …
• Data nodes
Indexing, Searches (send all indexing requests direct to data nodes)
• Use Tribe node to search across multiple clusters
15
17. Effects of shards, replication, indexes on Cluster
• Replication factor
More replicas – searches faster, but more memory pressure
We had factor 2 initially, later changed to 1
• Shards
More shards - better indexing rates, but more memory pressure
We had 2 per index initially, later as per customer 2 – 35 shards
• Index/Shard sizes
• Number of indexes (one big one, monthly, weekly, daily, hourly …)
• Index naming – performance, access control, data retention, shard size
• Know your data and plan shards and replicas
16
18. Field data cache
17
• When you do - sorting, facets/aggregation with high cardinality fields
All unique values are loaded to memory and held on to
never goes away
• Risks running out of memory
indices.breaker.fielddata.limit
indices.fielddata.cache.size
• Use doc_values - writes to a columnar store side of the inverted index
lives on disk instead of in heap memory (storage, indexing small effect)
for not_analyzed fields
default in Elasticsearch 2.0
19. Indexing
• Use Bulk Indexing
We use mapreduce, about 60 - 100 reducers do the indexing
flush size, find your sweet spot (ours is 5000)
index.refresh_interval: -1
Transport client - tcp vs http client, tcp slightly faster
Increase thread pool for bulk and adjust merge speed
• More shards better indexing, but watch cluster
• Watch out for Bulk Rejections and Hotspots
• Index direct to data nodes
• Now es-hadoop available
18
20. Key items for extremely large clusters
19
• Manage shard sizes and counts (including replicas)
• Hotspots - adjust shards per node
• Some Nodes/disks getting full
adjust disk.watermark low/high settings
• Disk failures (especially when you have multiple disks, striping)
remove disk from config and restart Node
• Set replication to 0 and adjust throttling for initial Bulk inserts
• Disable allocation for faster restarts
• Adjust throttling settings for recovery and indexing
• Elasticsearch shard is a Lucene index, max docs 2.1 billion
21. Watch out for
20
• Use Aliases from Day 1
• _type
use generic - minimize dynamic updating of mappings
• Template dir., all files will be picked up
• Scripting and Updates a bit slow, use carefully
• Node failures
• Disk failures
• Bulk Rejections
• Network timeouts
• ttl performance issues
22. Monitor and Stats
21
• Cluster and Node health/stats
• Heap
• Stats: clear view on what is going on in your cluster
intake volumes, when received at edge, when indexed, index rate
• Lots of APIs available for cluster/node health, stats
• Watch for hotspots – nodes, disks
• Watch for safety trips (from ES 1.4 onwards)
• Nagios, Zabbix, custom
• Housekeeping - Use curator or custom
• Use Marvel, Watcher
23. Get ready for production
• Difficult to recreate production volumes in Dev/QA
• Plan a buffering or queuing mechanism
• Be ready to Re-index
We had data in HDFS for a year and in ES for 6 months
• Monitor and Alert
With hundreds of machines/disks, something is bound to fail
• Stats
Find bottle necks, Project storage/processing needs
• Sharing a single config for same Node type helps
• Use automation as much as possible – Puppet, Ansible
22
24. Security & Access control
• Plan index per customer
• Use Aliases
• Control access via APIs
• Use a reverse proxy Apache, Nginx
Authentication/Authorization
Client nodes behind proxy
• Now Shield available
23
25. Tips on Searches
24
• Use Filters, they are cached
• Use match query instead of query_string
• term is not analyzed, match is analyzed
• For large search results – Use Scan search type and Scroll API