Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Cloud Security Monitoring
and Spark Analytics
Boston Spark Meetup
Threat Stack
Andre Mesarovic
10 December 2015
Threat Stack - Who We Are
• Leadership team with deep security, SaaS, and big data
experience
• Launched on stage at 2014 ...
Threat Stack - Use Cases
• Insider Threat Detection
• External Threat Detection
• Data Loss Detection
• Regulatory Complia...
Threat Stack - Key Workload Questions
• What processes are running on all my servers?
• Did a process suddenly start makin...
Threat Stack - Features
• Deep OS Auditing
• Behavior-based Intrusion Detection
• DVR Capabilities
• Customizable Alerts
•...
Threat Stack - Tech Stack
• RabbitMQ
• Nginx
• Cassandra
• Elasticsearch
• MongoDB
• Redis - ElastiCache
• Postgres - RDS
...
Spark Cluster
• Spark 1.4.1
• Spark standalone cluster manager - no Mesos or Yarn
• One long running Spark job - running o...
Spark Cluster Hardware
Threat Stack Overall Architecture
Spark Analytics Architecture
Spark Web UI - Master
Spark Web UI - Jobs
Event Pipeline Statistics
Mean event is 700 bytes
Second 10 Min Interval Day Month
Mean events 75 K 4.5 M 6.48 B 194 B
Spi...
Problem that Spark Analytics Addresses
• Overview
– Spark replaced home-grown rollups and Elasticsearch facets
– Original ...
Why not Spark Streaming?
• We first tried to use Spark Streaming
• Ran OK in dev env but failed in prod env - 20x
• Too ma...
Current Spark Solution
• Decouple event consumption and Spark processing
• Two processes: Event Writer and Spark Analytics...
Basic Workflow
• Event Writer consumes RMQ messages and writes them to S3
• RMQ messages are in MessagePack format
• Messa...
Advantages of Current Solution
• Separation of concerns - each process is focused on doing one
thing best
• Event Writer i...
Capacity and Scaling
• Good news - Spark scales linearly for us
• We ran tests with different numbers of workers and resul...
Event Writer Stats
• One Event Writer per RabbitMQ exchange
• We have 3 RMQ exchanges
• 10 minute interval for buffering e...
Event Types
• audit - accept, bind, connect, exit, etc.
• login - login, logout
• host
• file
• network
Event Example
{
"organization_id" : "3d0c49e818bac99c72b7088665342daf30a3bcd7",
"agent_id" : "835af48534bfd4bc60f8c5882dd5...
Spark Event Count Rollups
• total counts - org and agent
• user counts - org, agent, user and exe
• IP counts that access ...
Sample Rollups Table
insert_time | event_time | org_id | agent_id | count
---------------------+---------------------+----...
Scratch Event Data
• S3
– Easy to get started with Spark S3 support (gzip support)
– Mean write time is 350 ms - 99.9 perc...
S3 Write Percentiles
Percentile Millis
50.00 349
90.00 560
99.00 1413
99.50 2081
99.90 23,898
99.99 50,281
max 139,596
S3 vs Redis Write Latencies
All write latencies are in milliseconds.
The “10-minute intervals” column refers to the sample...
Data Expiration
• The problem of big data is how to efficiently delete data
• Every byte costs - AWS is not cheap
• Big da...
RabbitMQ Flow Control - Message Ack-ing
Flow control is fun!
• Fast publisher - slow consumer
Message Ack-ing
• MultipleRm...
RabbitMQ Prefetch Count
• Limit the number of unacknowledged messages on a channel
• Important for Event Writer to handle ...
Fault Tolerance
• Created generic fault tolerance manager
• Used for retrying RabbitMQ consumer and S3 writes
• Pluggable ...
Spark and Metrics
• Metrics and monitoring are vital to Threat Stack
• Any production app must have a way of allowing for ...
Upcoming SlideShare
Loading in …5
×

Cloud Security Monitoring and Spark Analytics

898 views

Published on

Linux OS events are streamed through RabbitMQ to Spark to generate Postgres rollup tables.

Published in: Technology
  • Be the first to comment

Cloud Security Monitoring and Spark Analytics

  1. 1. Cloud Security Monitoring and Spark Analytics Boston Spark Meetup Threat Stack Andre Mesarovic 10 December 2015
  2. 2. Threat Stack - Who We Are • Leadership team with deep security, SaaS, and big data experience • Launched on stage at 2014 AWS re:Invent • Founded by principal engineers from Mandiant in 2012 • Based in Boston's Innovation District • 27 employees and hiring • On Track for 100+ Customers and 10,000 Monitored Servers by Year-End 2015 • Funded by Accomplice (Atlas) and .406 Ventures
  3. 3. Threat Stack - Use Cases • Insider Threat Detection • External Threat Detection • Data Loss Detection • Regulatory Compliance Support - HIPAA, PCI
  4. 4. Threat Stack - Key Workload Questions • What processes are running on all my servers? • Did a process suddenly start making outbound connections? • Who is logged into my servers and what are they running? • Has anyone logged in from non-standard locations? • Are any critical system and data files being changed? • What happened on a transient server 7 weeks ago? • Who is changing our Cloud infrastructure?
  5. 5. Threat Stack - Features • Deep OS Auditing • Behavior-based Intrusion Detection • DVR Capabilities • Customizable Alerts • File Integrity Monitoring • DevOps Enabled Deployment
  6. 6. Threat Stack - Tech Stack • RabbitMQ • Nginx • Cassandra • Elasticsearch • MongoDB • Redis - ElastiCache • Postgres - RDS • Languages: Node.js, C, Scala and a bit of Lua • Chef • Librato, Grafana, Sensu, Sentry, PagerDuty • Slack
  7. 7. Spark Cluster • Spark 1.4.1 • Spark standalone cluster manager - no Mesos or Yarn • One long running Spark job - running over 2 months • Separate driver node – Since driver has different workload it can be scaled independently of the workers • We like our cluster to be a homogenous set of worker nodes – One executor per worker • Monitored by Grafana • Custom Codahale metrics consumed by Grafana – Only implemented for Driver - for Worker it’s a TODO
  8. 8. Spark Cluster Hardware
  9. 9. Threat Stack Overall Architecture
  10. 10. Spark Analytics Architecture
  11. 11. Spark Web UI - Master
  12. 12. Spark Web UI - Jobs
  13. 13. Event Pipeline Statistics Mean event is 700 bytes Second 10 Min Interval Day Month Mean events 75 K 4.5 M 6.48 B 194 B Spike events 125 K 7.5 M 10.8 B 324 B Mean bytes 52.5 MB 31.5 GB 4.5 TB 136 TB Spike bytes 87.5 MB 52.5 GB 7.6 TB 227 TB
  14. 14. Problem that Spark Analytics Addresses • Overview – Spark replaced home-grown rollups and Elasticsearch facets – Original solutions did not scale well • Home-grown rollups of streaming data – Used eep.js - subset of CEP that adds aggregate functions and windowed stream operations to Node.js. – Postgres stored procedures to upsert rolled up values – Problem: way too many Postgres transactions • Elasticsearch facets – Great for initial moderate volume – Running into scaling issues as we grow
  15. 15. Why not Spark Streaming? • We first tried to use Spark Streaming • Ran OK in dev env but failed in prod env - 20x • Too many endurance and scaling problems • Ran out of file descriptors on workers very quickly – Sure, we can write a cron job but do we want to? – Zillions of 24 byte files that were never cleaned up • Too many out-of-memory errors on workers – Intermittent and random OOMs – Workers crashed in 3 days due to tiny memory leak • No robust RabbitMQ receiver - everyone is focused on Kafka • Love the idea, but just wasn’t ready for prime time
  16. 16. Current Spark Solution • Decouple event consumption and Spark processing • Two processes: Event Writer and Spark Analytics • Event Writer consumes events from RabbitMQ firehose – Writes batches to scratch store every 10 min interval • Spark job wakes up every 10 min to roll up events by different criteria into Postgres – For example, at 10:20 Spark job processes the data from 10:10 to 10:20 • Spark then deletes the interval data of 10:10 to 10:20 • Spark uptime: 64 days since Oct. 7, 2015
  17. 17. Basic Workflow • Event Writer consumes RMQ messages and writes them to S3 • RMQ messages are in MessagePack format • Message is one doc per org/agent/type specified header and array of events • Event Writer flattens this into a batch of events • Output is gzip JSON sequence file - one JSON object per line • Event Writer writes fixed sized output batches of events to S3 • Current memory buffer for the batch is 100 MB • This compresses down to 3.5 MB - 28x compression
  18. 18. Advantages of Current Solution • Separation of concerns - each process is focused on doing one thing best • Event Writer is concerned with non-trivial RMQ flow control • Spark simply reads event sequences from scratch storage • Thus Spark has more resources to compute rollups • Each app can scale independently • Spark Streaming was trying to do too much - both handle RMQ ingestion and analytics processing • Current solution is more robust
  19. 19. Capacity and Scaling • Good news - Spark scales linearly for us • We ran tests with different numbers of workers and results were linear • Elasticity: we can independently scale the Event Writers and the Spark cluster • With Spark Streaming we could not dynamically add more RMQ receivers without restarting the app
  20. 20. Event Writer Stats • One Event Writer per RabbitMQ exchange • We have 3 RMQ exchanges • 10 minute interval for buffering events • 100 MB in-memory event buffer compresses down to 3.5 MB • Compression factor of 28 x • 600 S3 objects per interval (compressed) • 2.1 GB per interval (uncompressed would be 58.8 GB) • Need 2 intervals present - current and previous - 4.1 GB (118 GB)
  21. 21. Event Types • audit - accept, bind, connect, exit, etc. • login - login, logout • host • file • network
  22. 22. Event Example { "organization_id" : "3d0c49e818bac99c72b7088665342daf30a3bcd7", "agent_id" : "835af48534bfd4bc60f8c5882dd565c5a84e4b94", "arguments" : "/usr/sbin/sshd -D -R", "_id" : "835af48534bfd4bc60f8c5882dd565c5a84e4b94", "_type" : "audit", "_insert_time" : 1429902593 "args" : [ "/usr/sbin/sshd", "-D", "-R" ], "user" : "root", "group" : "root", "path" : [ "/usr/sbin/sshd", null ], "exe" : "/usr/sbin/sshd", "timestamp" : 1429902590000, "type" : "start", "syscall" : "execve", "command" : "sshd", "uid" : 0, "euid" : 0, "gid" : 0, "egid" : 0, "exit" : 0, "session" : 4294967295, "pid" : 7829, "ppid" : 873, "success" :, "parent_process" : { "pid" : 873, "exe" : "/usr/sbin/sshd", "command" : "sshd", "args" : [ "/usr/sbin/sshd", "-D" ], "loginuid" : 4294967295, "timestamp" : 1427337850230, "uid" : 0, "gid" : 0, "ppid" : 1 },
  23. 23. Spark Event Count Rollups • total counts - org and agent • user counts - org, agent, user and exe • IP counts that access Maxmind geo DB file on each worker – IP source counts - org, exe, ip, country, city, lat, lon – IP destination counts - ibid • host counts - org, comment • port source counts - org, exe and port • port destination counts • CloudTrail events of various (four) kinds
  24. 24. Sample Rollups Table insert_time | event_time | org_id | agent_id | count ---------------------+---------------------+--------------------------+--------------------------+-------- 2015-11-08 15:41:18 | 2015-11-08 15:00:00 | 5522d0276c15919d69000x01 | 563bd15419d2f85c2c9085c1 | 216652 2015-11-08 20:01:24 | 2015-11-08 19:00:00 | 5522d0276c15919d69000x01 | 563bd15419d2f85c2c9085c1 | 207962 2015-11-08 15:31:17 | 2015-11-08 15:00:00 | 5522d0276c15919d69000x01 | 5665c53b04d674f048e0892e | 160354 2015-11-08 15:01:34 | 2015-11-08 14:00:00 | 5522d0276c15919d69000y01 | 563bd15419d2f85c2c9085c1 | 160098 2015-11-07 21:51:31 | 2015-11-07 20:00:00 | 5522d0276c15919d69000x01 | 5665c53b04d674f048e0892e | 149813 2015-11-08 03:08:53 | 2015-11-08 00:00:00 | 533af57f41e9885820006771 | 5632c6431612b6096d195d02 | 144999 2015-11-08 03:08:53 | 2015-11-08 01:00:00 | 533af57f41e988582000a7b1 | 55fc8beb7f8ce68d5052b6c9 | 143072 2015-11-08 03:08:53 | 2015-11-08 01:00:00 | 533af57f41e9885820006771 | 55f989dacc155d6d5e2627cf | 141468 2015-11-08 03:08:53 | 2015-11-08 01:00:00 | 533af57f41e9885820006771 | 55f98b41cc155d6d5e262811 | 137778 2015-11-17 15:21:11 | 2015-11-17 15:00:00 | 5522d0276c15919d69000x01 | 566f217100229a8b2bdce000 | 128375
  25. 25. Scratch Event Data • S3 – Easy to get started with Spark S3 support (gzip support) – Mean write time is 350 ms - 99.9 percentile is 2.3 sec! – This clogs up our processing pipeline – S3 is “eventually consistent” - there are no SLAs guaranteeing when a written object is available • Alternatives – NoSQL store such as Redis - under active exploration now – AWS Elastic File System - when will it arrive (April blog)? – HDFS
  26. 26. S3 Write Percentiles Percentile Millis 50.00 349 90.00 560 99.00 1413 99.50 2081 99.90 23,898 99.99 50,281 max 139,596
  27. 27. S3 vs Redis Write Latencies All write latencies are in milliseconds. The “10-minute intervals” column refers to the sample size. Mean Max 10-min intervals S3 349 139,596 15,172 Redis 43 168 7,313 Speedup factor 8 831
  28. 28. Data Expiration • The problem of big data is how to efficiently delete data • Every byte costs - AWS is not cheap • Big data at scale costs big bucks • In the real world, companies have to deal with data retention • Deleting objects – Spark • After processing S3 objects, Spark deletes them • Backup with AWS life-cycle expiration (1 day) – Redis • Use Redis TTLs
  29. 29. RabbitMQ Flow Control - Message Ack-ing Flow control is fun! • Fast publisher - slow consumer Message Ack-ing • MultipleRmqAckManager - Acknowledge all messages up to and including the supplied delivery tag • SingleRmqAckManager - Acknowledge just the supplied delivery tag • When we have written an S3 object, we ack all the RMQ messages in that batch
  30. 30. RabbitMQ Prefetch Count • Limit the number of unacknowledged messages on a channel • Important for Event Writer to handle so as not to OOM during traffic surges • Sadly RMQ doesn’t implement AMQP prefetch for byte size • Only supports prefetch count for number of messages • This works if the messages are of relatively same size • Fortunately this the case for us
  31. 31. Fault Tolerance • Created generic fault tolerance manager • Used for retrying RabbitMQ consumer and S3 writes • Pluggable retry algorithm - linear backoff, exponential backoff, whatever you wish • Looked at third party packages (e.g. Spring Retry) but didn’t quite fit our particular needs • RMQ reads rarely fail • Do see the occasional S3 write failure
  32. 32. Spark and Metrics • Metrics and monitoring are vital to Threat Stack • Any production app must have a way of allowing for app- specific metrics • Spark’s custom metrics are very rudimentary • Custom metrics capabilities - driver and/or worker? • Spark Codahale custom metrics - we apparently have to extend Spark private class! • You need to extend org.apache.spark.metrics.source.Source and include it in your jar!

×