Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads with Amazon EBS

279 views

Published on

ProtectWise shifts network security to the cloud to provide complete visibility and detection of enterprise threats and incident response. Built entirely on AWS, the ProtectWise grid has the unique ability to create an unlimited retention window with full-fidelity forensics, automated retrospection, and advanced visualization. To enable customers to store petabytes of networking data and analyze it in seconds, they use Apache Solr and Apache Cassandra to analyze encrypted raw packet data and metadata about network packets—billions of items per day. Maintaining an architecture to handle a large volume of data requires an innovative architecture at a cost-effective standpoint. In this session, you learn how ProjectWise has optimized their solution on AWS using hot, warm, and cold shards across EC2 instance store, Amazon Elastic Block Store (Amazon EBS), and Amazon S3 for cost and scalability.

  • Be the first to comment

STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads with Amazon EBS

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT ProtectWise optimizes performance of Cassandra and Kafka workloads with Amazon EBS G E N E S T E V E N S , C T O & C O - F O U N D E R , P R O T E C T W I S E R O B E R T T A R R A L L , D I R E C T O R O F D E V O P S , P R O T E C T W I S E A N D R E Y Z A Y C H I K O V , S R . S O L U T I O N S A R C H I T E C T , A W S STG329 N o v e m b e r , 2 8 , 2 0 1 7
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AGENDA • Intro to NoSQL on AWS • Intro to Apache Cassandra and Apache Kafka on AWS • Best practices for Cassandra and Kafka deployments on AWS • ProtectWise Use Case • Use Case & Optimizations for Kafka • Use Case & Optimizations for Cassandra • Use Case & Optimizations for Amazon S3 • Lessons Learned
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NoSQL as a technology
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Database per Workload Penatho Talend Vertica Aerospike Cassandra MongoDB
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Database options on AWS
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data movement OnlineOffline Data security and management Complete set of data building blocks Amazon EFS Amazon EBS AWS Snow family AWS Storage Gateway Family AWS Direct Connect Amazon EFS File Sync Amazon S3 Transfer Acceleration Storage Partners Amazon Kinesis Data Streams Amazon Kinesis Video Streams Amazon S3 Amazon Glacier AWS KMS AWS IAM AWS CloudWatch AWS CloudTrail AWS Cloud Formation AWS Lambda Amazon Macie AWS QuickSight
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Cassandra and why to use it? • Apache Cassandra is an open- source database based on Dynamo model • It massively scalable geo- distributed high-performance key value database • Among most often use cases for Cassandra we can name: • Time-series data • Social media • User sessions (aka shopping carts, etc.)
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How Cassandra works on a cluster level
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How Cassandra works on a node level
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Application interactions with Cassandra
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best practices for Cassandra on AWS
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Apache Kafka and why to use it? • Apache Kafka is an open-source distributed streaming platform • It allows you to: • Publish and subscribe to streams of records • Store streams in a fault-tolerant way • Process streams of records • Most common use cases for Kafka are: • Build data pipelines to capture and transfer data between systems & applications • Build real-time apps which react to the streams of data • Kafka is often used as a means to capture fast arriving data and put before database, for example, Cassandra. Such a setup can reduce amount of pressure on the database.
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How Kafka works
  14. 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best practices for Kafka on AWS
  15. 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Choosing proper instance and storage types Database implementation, data schema, and access patterns should always be considered. Compute and storage types should always be adapted to particular situation and can change during DB lifetime. Cassandra Kafka
  16. 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CASE STUDY: PROTECTWISE GENE STEVENS, CTO & CO-FOUNDER, PROTECTWISE ROBERT TARRALL, DIRECTOR OF DEVOPS, PROTECTWISE
  17. 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PROTECTWISE OVERVIEW
  18. 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. GOALS • Very low end-to-end latency (~1 second) • Very high availability • Over 1 billion writes per hour • High tolerance for bursts (10x-100x normal volume) • Trillions of records per year • Less than 10-second response time to searches • Arbitrary queries: “all non-HTTP traffic on port 80”
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. INITIAL ARCHITECTURE
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SOLUTION: Amazon S3, Kafka, Amazon EBS
  22. 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SOLUTION: Amazon S3, Kafka, Amazon EBS
  23. 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SOLUTION: Amazon S3, Kafka, Amazon EBS
  24. 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KAFKA Our Kafka clusters: • 1,000 topics • Up to 200 partitions per topic • 45 c4.2xlarge • 2x 1 TB gp2 EBS volumes • Peak consumption > 100 MB/sec/server
  25. 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KAFKA: WINS • Retention: 24 hours of “buffer” • Pub/sub with “at least once” guarantee • Fanout means we can test in production: • New engines publish to “profiling” topic, confirm useful detections • Significant code changes can be performance-tested
  26. 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KAFKA: NOTES • Partition is your fundamental unit of scaling – use lots of partitions • Use round robin partition assignment, not range • Be sure to test “edge” cases: recovery times, backlog • As broker recovers each partition, consumers rebalance • “At least once” = “sometimes more than once”
  27. 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KAFKA: WARNINGS/CAVEATS • Beware cross-AZ replication costs! • Kafka has only limited “rack awareness” • Producers and consumers talk to the “leader” of a partition • With RF=2, data may cross AZs 3 (or more) times
  28. 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KAFKA – Cross-AZ Traffic
  29. 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KAFKA – Cross-AZ Traffic
  30. 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KAFKA: WARNINGS/CAVEATS • Monthly costs in perspective: • Instances: $8,000 • gp2 EBS volumes: $8,000 • Network traffic: $40,000
  31. 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KAFKA: WARNINGS/CAVEATS • Single broker failure impacts the whole cluster • “Let’s bump that timeout” often has unexpected consequences • Mostly trust default settings (but use round-robin!)
  32. 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DATABASE • Sustain 250K writes/sec (bursts > 1 million/sec) • 1 year of data • Support arbitrary queries • < 10-second response time for search
  33. 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DATABASE (v1) • We use DataStax Enterprise Search • Supports over 1 TB of data + index on an i2.2xlarge • Handled the load, but one month of data = 100x i2.2xlarge • We keep a year of data… • Sharded by time, migrated older data to r4.2xlarge with gp2 EBS volumes
  34. 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DATABASE (v1) – Lessons Learned • Use DSE 5.0 or later – much better indexing throughput • Don’t use vnodes • Do use large heap (20-30 GB is fine) and G1GC • Beware outdated blog posts! (Amazon EBS has come a LONG way) • High write throughput + search leads to high operational burden • Use Amazon EBS if at all possible
  35. 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EBS! Migrating data to Amazon EBS taught us that we really want to use Amazon EBS: • Instances without ephemeral fail less frequently • Amazon EBS volumes do fail, but very rarely • Decoupling state from compute is a huge win: • Need more CPU in your Cassandra cluster? Stop one AZ, change instance type, start; repeat for all AZs • Modify Amazon EBS volume to expand storage
  36. 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 for Full-text Search Today’s architecture (write path): • Data written to Cassandra with a TTL • Once final (a few hours), a Spark job on Amazon EMR: • Reads data from C* • Writes Parquet files to Amazon S3 • Writes Bloom filters to Solr
  37. 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 for Full-text Search Read path: • Very recent data answered from C*/Solr • Bloom filters tell us which parquet files to lift from Amazon S3 for older data • Spark on Amazon EMR reads the Parquet files – highly parallel
  38. 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. LESSONS LEARNED - KAFKA Overall Kafka is a very valuable part of our platform and works great on EBS. If expecting massive scale, keep the following in mind: • Cross-AZ replication cost adds up. 2 GB/sec for 1 month is 5 petabytes. • A single broker can cause availability problems (not data loss) for the whole cluster. • Small clusters are very easy to operate; larger clusters have more issues and higher mean time to recovery.
  39. 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. LESSONS LEARNED - CASSANDRA • Cassandra and Amazon EBS have come a long way in a short time • Ignore most of what was written before 2016! • 2014: “unless you want to add more complexity for your operations team… choose ephemeral” (DataStax blog) • Today: “gp2 volumes… best choice for most workloads”
  40. 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. LESSONS LEARNED - CASSANDRA • Cassandra is REALLY good at handling bursts • Take the time to run benchmarks matching your expected workload: • Run long enough to reach “steady state” (hours to days) • Object sizes, read/write ratios, key distribution • Compaction strategy • Watch for: • Pending compactions • Blocked native transport requests
  41. 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. LESSONS LEARNED – Amazon EBS & Amazon S3 • Higher latency (vs. ephemeral) doesn’t mean lower throughput! • Mitigate latency impact by increasing parallelism • Major operational wins: • Much higher reliability (both storage and compute) • Decoupling state from compute allows each to be independently adjusted
  42. 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. LESSONS LEARNED – Amazon S3 Planning for high Amazon S3 request rate: • Add random prefix to avoid hotspots: • s3://bucket/ApD4J. <object_name> • If you have sufficient randomness, you’re not going to run into Amazon S3 limits… • We’ve had over 1 billion objects and 5 petabytes in a bucket • We made 1600 API calls/sec against that bucket for 2 weeks on top of regular production workload with zero impact
  43. 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you!

×