More Related Content Similar to STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads with Amazon EBS (20) More from Amazon Web Services (20) STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads with Amazon EBS1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
ProtectWise optimizes performance of
Cassandra and Kafka workloads with
Amazon EBS
G E N E S T E V E N S , C T O & C O - F O U N D E R , P R O T E C T W I S E
R O B E R T T A R R A L L , D I R E C T O R O F D E V O P S , P R O T E C T W I S E
A N D R E Y Z A Y C H I K O V , S R . S O L U T I O N S A R C H I T E C T , A W S
STG329
N o v e m b e r , 2 8 , 2 0 1 7
2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AGENDA
• Intro to NoSQL on AWS
• Intro to Apache Cassandra and Apache Kafka on AWS
• Best practices for Cassandra and Kafka deployments on AWS
• ProtectWise Use Case
• Use Case & Optimizations for Kafka
• Use Case & Optimizations for Cassandra
• Use Case & Optimizations for Amazon S3
• Lessons Learned
3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NoSQL as a technology
4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Database per Workload
Penatho
Talend
Vertica
Aerospike
Cassandra
MongoDB
5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Database options on AWS
6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data movement
OnlineOffline
Data security
and management
Complete set of data building blocks
Amazon
EFS
Amazon
EBS
AWS Snow family
AWS Storage Gateway
Family
AWS Direct Connect
Amazon EFS File Sync
Amazon S3
Transfer Acceleration
Storage Partners
Amazon Kinesis
Data Streams
Amazon Kinesis
Video Streams
Amazon
S3
Amazon
Glacier
AWS KMS
AWS IAM
AWS CloudWatch
AWS CloudTrail
AWS Cloud Formation
AWS Lambda
Amazon Macie
AWS QuickSight
7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Cassandra and why to use it?
• Apache Cassandra is an open-
source database based on
Dynamo model
• It massively scalable geo-
distributed high-performance key
value database
• Among most often use cases for
Cassandra we can name:
• Time-series data
• Social media
• User sessions (aka shopping
carts, etc.)
8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How Cassandra works on a cluster level
9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How Cassandra works on a node level
10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Application interactions with Cassandra
11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best practices for Cassandra on AWS
12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Apache Kafka and why to use it?
• Apache Kafka is an open-source distributed
streaming platform
• It allows you to:
• Publish and subscribe to streams of records
• Store streams in a fault-tolerant way
• Process streams of records
• Most common use cases for Kafka are:
• Build data pipelines to capture and transfer
data between systems & applications
• Build real-time apps which react to the
streams of data
• Kafka is often used as a means to capture fast
arriving data and put before database, for
example, Cassandra. Such a setup can reduce
amount of pressure on the database.
13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How Kafka works
14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best practices for Kafka on AWS
15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Choosing proper instance and storage types
Database implementation, data schema, and access patterns should always be
considered. Compute and storage types should always be adapted to particular
situation and can change during DB lifetime.
Cassandra Kafka
16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CASE STUDY: PROTECTWISE
GENE STEVENS, CTO & CO-FOUNDER, PROTECTWISE
ROBERT TARRALL, DIRECTOR OF DEVOPS, PROTECTWISE
17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
PROTECTWISE OVERVIEW
18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
GOALS
• Very low end-to-end latency (~1 second)
• Very high availability
• Over 1 billion writes per hour
• High tolerance for bursts (10x-100x
normal volume)
• Trillions of records per year
• Less than 10-second response time to
searches
• Arbitrary queries: “all non-HTTP traffic on
port 80”
20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
INITIAL ARCHITECTURE
21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SOLUTION: Amazon S3, Kafka, Amazon EBS
22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SOLUTION: Amazon S3, Kafka, Amazon EBS
23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SOLUTION: Amazon S3, Kafka, Amazon EBS
24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KAFKA
Our Kafka clusters:
• 1,000 topics
• Up to 200 partitions per topic
• 45 c4.2xlarge
• 2x 1 TB gp2 EBS volumes
• Peak consumption > 100
MB/sec/server
25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KAFKA: WINS
• Retention: 24 hours of “buffer”
• Pub/sub with “at least once”
guarantee
• Fanout means we can test in
production:
• New engines publish to
“profiling” topic, confirm useful
detections
• Significant code changes can be
performance-tested
26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KAFKA: NOTES
• Partition is your fundamental unit of
scaling – use lots of partitions
• Use round robin partition
assignment, not range
• Be sure to test “edge” cases:
recovery times, backlog
• As broker recovers each
partition, consumers
rebalance
• “At least once” = “sometimes more
than once”
27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KAFKA: WARNINGS/CAVEATS
• Beware cross-AZ replication costs!
• Kafka has only limited “rack
awareness”
• Producers and consumers talk to
the “leader” of a partition
• With RF=2, data may cross AZs 3
(or more) times
28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KAFKA – Cross-AZ Traffic
29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KAFKA – Cross-AZ Traffic
30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KAFKA: WARNINGS/CAVEATS
• Monthly costs in perspective:
• Instances: $8,000
• gp2 EBS volumes: $8,000
• Network traffic: $40,000
31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KAFKA: WARNINGS/CAVEATS
• Single broker failure impacts the
whole cluster
• “Let’s bump that timeout” often
has unexpected consequences
• Mostly trust default settings (but
use round-robin!)
32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DATABASE
• Sustain 250K writes/sec (bursts > 1 million/sec)
• 1 year of data
• Support arbitrary queries
• < 10-second response time for search
33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DATABASE (v1)
• We use DataStax Enterprise
Search
• Supports over 1 TB of data +
index on an i2.2xlarge
• Handled the load, but one
month of data = 100x
i2.2xlarge
• We keep a year of data…
• Sharded by time, migrated
older data to r4.2xlarge with
gp2 EBS volumes
34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DATABASE (v1) – Lessons Learned
• Use DSE 5.0 or later – much better indexing throughput
• Don’t use vnodes
• Do use large heap (20-30 GB is fine) and G1GC
• Beware outdated blog posts! (Amazon EBS has come a LONG
way)
• High write throughput + search leads to high operational burden
• Use Amazon EBS if at all possible
35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EBS!
Migrating data to Amazon EBS taught us that
we really want to use Amazon EBS:
• Instances without ephemeral fail less
frequently
• Amazon EBS volumes do fail, but very rarely
• Decoupling state from compute is a huge
win:
• Need more CPU in your Cassandra
cluster? Stop one AZ, change instance
type, start; repeat for all AZs
• Modify Amazon EBS volume to expand
storage
36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 for Full-text Search
Today’s architecture (write path):
• Data written to Cassandra with
a TTL
• Once final (a few hours), a
Spark job on Amazon EMR:
• Reads data from C*
• Writes Parquet files to
Amazon S3
• Writes Bloom filters to
Solr
37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 for Full-text Search
Read path:
• Very recent data
answered from C*/Solr
• Bloom filters tell us
which parquet files to
lift from Amazon S3 for
older data
• Spark on Amazon EMR
reads the Parquet files –
highly parallel
38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
LESSONS LEARNED - KAFKA
Overall Kafka is a very valuable part of our platform and works great
on EBS.
If expecting massive scale, keep the following in mind:
• Cross-AZ replication cost adds up. 2 GB/sec for 1 month is 5
petabytes.
• A single broker can cause availability problems (not data loss) for
the whole cluster.
• Small clusters are very easy to operate; larger clusters have more
issues and higher mean time to recovery.
39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
LESSONS LEARNED - CASSANDRA
• Cassandra and Amazon EBS have
come a long way in a short time
• Ignore most of what was written
before 2016!
• 2014: “unless you want to
add more complexity for
your operations team…
choose ephemeral”
(DataStax blog)
• Today: “gp2 volumes… best
choice for most workloads”
40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
LESSONS LEARNED - CASSANDRA
• Cassandra is REALLY good at handling bursts
• Take the time to run benchmarks matching your expected
workload:
• Run long enough to reach “steady state” (hours to days)
• Object sizes, read/write ratios, key distribution
• Compaction strategy
• Watch for:
• Pending compactions
• Blocked native transport requests
41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
LESSONS LEARNED – Amazon EBS &
Amazon S3
• Higher latency (vs. ephemeral) doesn’t
mean lower throughput!
• Mitigate latency impact by increasing
parallelism
• Major operational wins:
• Much higher reliability (both
storage and compute)
• Decoupling state from compute
allows each to be independently
adjusted
42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
LESSONS LEARNED – Amazon S3
Planning for high Amazon S3 request rate:
• Add random prefix to avoid hotspots:
• s3://bucket/ApD4J. <object_name>
• If you have sufficient randomness, you’re
not going to run into Amazon S3 limits…
• We’ve had over 1 billion objects and
5 petabytes in a bucket
• We made 1600 API calls/sec against
that bucket for 2 weeks on top of
regular production workload with
zero impact
43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!