Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(SEC403) Diving into AWS CloudTrail Events w/ Apache Spark on EMR


Published on

Do you want to analyze AWS CloudTrail events within minutes of them arriving in your Amazon S3 bucket? Would you like to learn how to run expressive queries over your CloudTrail logs? We will demonstrate Apache Spark and Apache Spark Streaming as two tools to analyze recent and historical security logs for your accounts. To do so, we will use Amazon Elastic MapReduce (EMR), your logs stored in S3, and Amazon SNS to generate alerts. With these tools at your fingertips, you will be the first to know about security events that require your attention, and you will be able to quickly identify and evaluate the relevant security log entries.

Published in: Technology

(SEC403) Diving into AWS CloudTrail Events w/ Apache Spark on EMR

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Will Kruse, AWS IAM Senior Security Engineer October 2015 SEC403 Timely Security Alerts and Analytics Diving into AWS CloudTrail Events by Using Apache Spark on Amazon EMR
  2. 2. What to expect from this session Why are we here? To learn how to: • Audit AWS activity across multiple AWS accounts for compliance and security. • Analyze AWS CloudTrail events as they arrive (in your Amazon S3 bucket). • Build profiles of AWS activity for users, origins, etc. • Send alerts when an unexpected or interesting event, or series of events, occurs. • Use Apache Spark, a cutting edge big data platform, on AWS for security and compliance auditing.
  3. 3. Expected technical background • You are generally familiar with “big data” processing frameworks (e.g., Hadoop). • You are familiar with CloudTrail. • You can read OO-code (e.g., Java, Scala, Python, Ruby, etc.). • You are comfortable with a command line.
  4. 4. CloudTrail schema
  5. 5. Demo: SQL queries over CloudTrail
  6. 6. Agenda • SQL queries over CloudTrail logs • Demo using spark-sql + hive tables • Architecture • Code • Demo of code using Scala • Processing CloudTrail logs as they arrive • Architecture • Demo • Code • Wrap-up You are here
  7. 7. Our architecture CloudTrail objects Amazon EMR cluster running Apache Spark Security or compliance analyst
  8. 8. Recipe for SQL queries over CloudTrail logs Write a Spark application that: 1. “Discovers” CloudTrail logs by calling CloudTrail. • Alternatively, put all your CloudTrail logs in one or more buckets known ahead of time. 2. Creates a list of CloudTrail trails + S3 objects. 3. Loads the data from each S3 object into an RDD. 4. Splits into individual CloudTrail event JSON objects. 5. Loads this RDD into a Spark DataFrame. 6. Register this DataFrame as a table (for querying).
  9. 9. Introduction to Apache Spark • Big data processing framework • Supported languages: Scala, Python, and Java • Cluster management: Hadoop YARN, Apache Mesos, or standalone • Distributed storage: HDFS, Apache Cassandra, OpenStack Swift, and S3
  10. 10. Why Spark? • Fast • Only does the work it needs to do • Stores final and intermediate results in memory • Supports batch and streaming processing • Supports SQL queries, machine learning (ML), graph data processing, and an R interface. • Provides 20+ high-level operators that would otherwise be left as an exercise to the coder • Compatible with much of your existing Hadoop ecosystem
  11. 11. RDDs = Resilient Distributed Datasets CloudTrail objects in S3 Log #2 Log #1 Log #N … Log #1 string Log #2 string … Log #N string RDD of JSON arrays of CloudTrail events (as strings) Event #1 Event #2 … Event #M Event #3 Event #4 flatMapLog #2 Log #1 Log #N … Log #1 string Log #2 string … Log #N string Event #1 Event #2 … Event #M RDD of CloudTrail events as individual JSON strings Event #3 Event #4 parallelize
  12. 12. DataFrames = Relational table abstraction Event #1 Event #2 … Event #M RDD of CloudTrail events Event #3 Event #4 service API Timestamp Source IP Principal Event #1 Event #2 Event #3 Event #4 … Event #M Event #1 Event #2 … Event #M RDD of CloudTrail events Event #3 Event #4 Service API Time Stamp Source IP Principal Event #1 ec2 D… 2015/08/31 1:10 AIDA1… Event #2 s3 P… 2015/08/31 1:11 AIDA2… Event #3 swf S… 2015/08/31 1:12 AROA1… Event #4 iam C… 2015/08/31 1:13 AROA2… … … … … … … Event #M CloudTrail D… 2015/08/31 2:43 AIDA3… SQLContext .read.json
  13. 13. Spark cluster components Master node Core node Executor Executor RDD partitions Core node Executor Executor RDD partitions Application driver Tasks (serialized Java/Scala)
  14. 14. Recommended CloudTrail configuration • Turn on CloudTrail logging in all regions. • Enable S3 bucket logging for all buckets as well. • Get all your CloudTrail logs for all your accounts in one bucket (per region). • Either have CloudTrail deliver them or copy them. • Disallow deletes from CloudTrail buckets.
  15. 15. Needed AWS IAM permissions • Getting started recommendation • Launch an EMR cluster with default roles. • Attach the CloudTrailReadOnly policy to the EMR_EC2_DefaultRole. • Least privilege improvements • Restrict s3:getObject and s3:listBucket to CloudTrail buckets. • Remove EMR’s DDB, Amazon Kinesis, Amazon RDS, Amazon SimpleDB, Amazon SNS, and Amazon SQS permissions.
  16. 16. Tour through code to query CloudTrail logs with SQL
  17. 17. Discover CloudTrail data
  18. 18. Transform CloudTrail data
  19. 19. Register CloudTrail data as a table
  20. 20. Demo: Querying CloudTrail logs with Scala prompt
  21. 21. Agenda • SQL Queries over CloudTrail logs • Demo using spark-sql • Architecture • Code • Demo of code using Scala • Processing CloudTrail logs as they arrive • Architecture • Demo • Code • Wrap-up You are here
  22. 22. Analytics as soon as CloudTrail data arrives in S3
  23. 23. Introduction to Spark Streaming CloudTrail SNS topic Spark CloudTrail receiver Executors New activity Batch N-1 RDD Batch N RDD Previous profile + = Update profile Alerts Alert topic Store Spark Application
  24. 24. Discretized stream (Dstream) RDD for micro-batch #3RDD for micro-batch #2RDD for micro-batch #1 Spark Streaming and micro-batches Time Event 1 Event 2 Event 3 Event 4 Event 5 Event 6 Event 7 Event 8 3 seconds 3 seconds 3 seconds
  25. 25. Recipe Write a Spark Streaming application that: 1. Uses a CloudTrail log receiver to learn about new logs from CloudTrail’s SNS feed. • Logs are delivered to S3, usually in less than 15 minutes. 2. Store()s each event from CloudTrail logs. 3. Analyzes events in micro-batches. • Size based on the “batch interval.” 4. Generates alarms on suspicious behavior.
  26. 26. Scenarios we want to know about ASAP • Connections from unusual geographies • Connections from anonymizing proxies • Use of dormant AWS access keys • Use of dormant AWS principals (users, roles, root)
  27. 27. Demo: Streaming analysis of CloudTrail logs
  28. 28. Creating stream of CloudTrail events
  29. 29. Build profiles and send alerts
  30. 30. Agenda • SQL Queries over CloudTrail logs • Demo using spark-sql • Architecture • Code • Demo of code using Scala • Processing CloudTrail logs as they arrive • Architecture • Demo • Code • Wrap-up You are here
  31. 31. How to use these tools 1. Build your threat model. 2. Configure and customize this streaming application. 3. Use Spark-on-EMR for ad hoc log analysis. 4. Use Spark Streaming for regular analysis and alerts.
  32. 32. How I use these tools 1. Keep my engineering teams honest. 2. Identify noncompliant usage. 3. Review actors and their actions in my accounts. 4. Craft least privilege policies by analyzing historical usage.
  33. 33. Take action • See who is active in your AWS accounts and when. • Run queries over your logs in EMR. • Configure and extend the sample application to meet your specific needs. • Find the demo code here:
  34. 34. Remember to complete your evaluations!
  35. 35. Thank you!