STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads with Amazon EBS

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
ProtectWise optimizes performance of
Cassandra and Kafka workloads with
Amazon EBS
G E N E S T E V E N S , C T O & C O - F O U N D E R , P R O T E C T W I S E
R O B E R T T A R R A L L , D I R E C T O R O F D E V O P S , P R O T E C T W I S E
A N D R E Y Z A Y C H I K O V , S R . S O L U T I O N S A R C H I T E C T , A W S
STG329
N o v e m b e r , 2 8 , 2 0 1 7

AGENDA
• Intro to NoSQL on AWS
• Intro to Apache Cassandra and Apache Kafka on AWS
• Best practices for Cassandra and Kafka deployments on AWS
• ProtectWise Use Case
• Use Case & Optimizations for Kafka
• Use Case & Optimizations for Cassandra
• Use Case & Optimizations for Amazon S3
• Lessons Learned

NoSQL as a technology

Database per Workload
Penatho
Talend
Vertica
Aerospike
Cassandra
MongoDB

Database options on AWS

Data movement
OnlineOffline
Data security
and management
Complete set of data building blocks
Amazon
EFS
Amazon
EBS
AWS Snow family
AWS Storage Gateway
Family
AWS Direct Connect
Amazon EFS File Sync
Amazon S3
Transfer Acceleration
Storage Partners
Amazon Kinesis
Data Streams
Amazon Kinesis
Video Streams
Amazon
S3
Amazon
Glacier
AWS KMS
AWS IAM
AWS CloudWatch
AWS CloudTrail
AWS Cloud Formation
AWS Lambda
Amazon Macie
AWS QuickSight

What is Cassandra and why to use it?
• Apache Cassandra is an open-
source database based on
Dynamo model
• It massively scalable geo-
distributed high-performance key
value database
• Among most often use cases for
Cassandra we can name:
• Time-series data
• Social media
• User sessions (aka shopping
carts, etc.)

How Cassandra works on a cluster level

How Cassandra works on a node level

Application interactions with Cassandra

Best practices for Cassandra on AWS

What is Apache Kafka and why to use it?
• Apache Kafka is an open-source distributed
streaming platform
• It allows you to:
• Publish and subscribe to streams of records
• Store streams in a fault-tolerant way
• Process streams of records
• Most common use cases for Kafka are:
• Build data pipelines to capture and transfer
data between systems & applications
• Build real-time apps which react to the
streams of data
• Kafka is often used as a means to capture fast
arriving data and put before database, for
example, Cassandra. Such a setup can reduce
amount of pressure on the database.

How Kafka works

Best practices for Kafka on AWS

Choosing proper instance and storage types
Database implementation, data schema, and access patterns should always be
considered. Compute and storage types should always be adapted to particular
situation and can change during DB lifetime.
Cassandra Kafka

CASE STUDY: PROTECTWISE
GENE STEVENS, CTO & CO-FOUNDER, PROTECTWISE
ROBERT TARRALL, DIRECTOR OF DEVOPS, PROTECTWISE

PROTECTWISE OVERVIEW

GOALS
• Very low end-to-end latency (~1 second)
• Very high availability
• Over 1 billion writes per hour
• High tolerance for bursts (10x-100x
normal volume)
• Trillions of records per year
• Less than 10-second response time to
searches
• Arbitrary queries: “all non-HTTP traffic on
port 80”

INITIAL ARCHITECTURE

SOLUTION: Amazon S3, Kafka, Amazon EBS

KAFKA
Our Kafka clusters:
• 1,000 topics
• Up to 200 partitions per topic
• 45 c4.2xlarge
• 2x 1 TB gp2 EBS volumes
• Peak consumption > 100
MB/sec/server

KAFKA: WINS
• Retention: 24 hours of “buffer”
• Pub/sub with “at least once”
guarantee
• Fanout means we can test in
production:
• New engines publish to
“profiling” topic, confirm useful
detections
• Significant code changes can be
performance-tested

KAFKA: NOTES
• Partition is your fundamental unit of
scaling – use lots of partitions
• Use round robin partition
assignment, not range
• Be sure to test “edge” cases:
recovery times, backlog
• As broker recovers each
partition, consumers
rebalance
• “At least once” = “sometimes more
than once”

KAFKA: WARNINGS/CAVEATS
• Beware cross-AZ replication costs!
• Kafka has only limited “rack
awareness”
• Producers and consumers talk to
the “leader” of a partition
• With RF=2, data may cross AZs 3
(or more) times

KAFKA – Cross-AZ Traffic

• Monthly costs in perspective:
• Instances: $8,000
• gp2 EBS volumes: $8,000
• Network traffic: $40,000

• Single broker failure impacts the
whole cluster
• “Let’s bump that timeout” often
has unexpected consequences
• Mostly trust default settings (but
use round-robin!)

DATABASE
• Sustain 250K writes/sec (bursts > 1 million/sec)
• 1 year of data
• Support arbitrary queries
• < 10-second response time for search

DATABASE (v1)
• We use DataStax Enterprise
Search
• Supports over 1 TB of data +
index on an i2.2xlarge
• Handled the load, but one
month of data = 100x
i2.2xlarge
• We keep a year of data…
• Sharded by time, migrated
older data to r4.2xlarge with
gp2 EBS volumes

DATABASE (v1) – Lessons Learned
• Use DSE 5.0 or later – much better indexing throughput
• Don’t use vnodes
• Do use large heap (20-30 GB is fine) and G1GC
• Beware outdated blog posts! (Amazon EBS has come a LONG
way)
• High write throughput + search leads to high operational burden
• Use Amazon EBS if at all possible

Amazon EBS!
Migrating data to Amazon EBS taught us that
we really want to use Amazon EBS:
• Instances without ephemeral fail less
frequently
• Amazon EBS volumes do fail, but very rarely
• Decoupling state from compute is a huge
win:
• Need more CPU in your Cassandra
cluster? Stop one AZ, change instance
type, start; repeat for all AZs
• Modify Amazon EBS volume to expand
storage

Amazon S3 for Full-text Search
Today’s architecture (write path):
• Data written to Cassandra with
a TTL
• Once final (a few hours), a
Spark job on Amazon EMR:
• Reads data from C*
• Writes Parquet files to
Amazon S3
• Writes Bloom filters to
Solr

Amazon S3 for Full-text Search
Read path:
• Very recent data
answered from C*/Solr
• Bloom filters tell us
which parquet files to
lift from Amazon S3 for
older data
• Spark on Amazon EMR
reads the Parquet files –
highly parallel

LESSONS LEARNED - KAFKA
Overall Kafka is a very valuable part of our platform and works great
on EBS.
If expecting massive scale, keep the following in mind:
• Cross-AZ replication cost adds up. 2 GB/sec for 1 month is 5
petabytes.
• A single broker can cause availability problems (not data loss) for
the whole cluster.
• Small clusters are very easy to operate; larger clusters have more
issues and higher mean time to recovery.

LESSONS LEARNED - CASSANDRA
• Cassandra and Amazon EBS have
come a long way in a short time
• Ignore most of what was written
before 2016!
• 2014: “unless you want to
add more complexity for
your operations team…
choose ephemeral”
(DataStax blog)
• Today: “gp2 volumes… best
choice for most workloads”

LESSONS LEARNED - CASSANDRA
• Cassandra is REALLY good at handling bursts
• Take the time to run benchmarks matching your expected
workload:
• Run long enough to reach “steady state” (hours to days)
• Object sizes, read/write ratios, key distribution
• Compaction strategy
• Watch for:
• Pending compactions
• Blocked native transport requests

LESSONS LEARNED – Amazon EBS &
Amazon S3
• Higher latency (vs. ephemeral) doesn’t
mean lower throughput!
• Mitigate latency impact by increasing
parallelism
• Major operational wins:
• Much higher reliability (both
storage and compute)
• Decoupling state from compute
allows each to be independently
adjusted

LESSONS LEARNED – Amazon S3
Planning for high Amazon S3 request rate:
• Add random prefix to avoid hotspots:
• s3://bucket/ApD4J. <object_name>
• If you have sufficient randomness, you’re
not going to run into Amazon S3 limits…
• We’ve had over 1 billion objects and
5 petabytes in a bucket
• We made 1600 API calls/sec against
that bucket for 2 weeks on top of
regular production workload with
zero impact

Thank you!

STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads with Amazon EBS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads with Amazon EBS

Similar to STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads with Amazon EBS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads with Amazon EBS