Nisum - Global Big Data Conference - Advance Cassandra by Faraz Mohammed
1. UNITED STATES CHILE INDIA NISUM.COM P. 1
What is “big” in big data?
……Cassandra
Faraz Mohammed
VP @ Nisum
• Innovation Lab – time boxed, fixed cost co-research with clients
on complex problems
• IT Consulting/Implementation
June 9, 2017
Global Software Architecture Conference
2. UNITED STATES CHILE INDIA NISUM.COM P. 2
Simplifying complex technologies (cutting edge) adoption, backed
by deep research and understanding.
Who are we?
3. UNITED STATES CHILE INDIA NISUM.COM P. 3
Agenda
– What is “big” in Bigdata?
– Something interesting happening
– Cassandra
4. UNITED STATES CHILE INDIA NISUM.COM P. 4
Does data size really
matters today?
BigData
.......Break thru to process
large amount of data
RDBMS
……Large data, yet
RDBMS
…..Struggle of converting
OLTP to OLAP
Technology
Explosion
.......Too many options,
complex choices.
DBMS
……small data
5 years ago
.......data was big here …..and here ….. but not here
5. UNITED STATES CHILE INDIA NISUM.COM P. 5
Example….RDBMS vs Big Data Tech
Today we can handle large
data… we just need to choose
right technology.
7. UNITED STATES CHILE INDIA NISUM.COM P. 7
Something interesting happening
Heavy downloads
Negligible uploads
Heavy downloads
Heavy uploads
Internet is turning upside down,
or to be precise downside up
8. UNITED STATES CHILE INDIA NISUM.COM P. 8
Product Digitalization: data will keep
growing
Google Cars - ~2 PB per year par car
9. UNITED STATES CHILE INDIA NISUM.COM P. 9
Our Observation
Despite the fact that data is growing
significantly, and its not going to slow down.
The present day challenge is not the volume or
variety of data, but rather it is the overload of
“technologies”.
10. UNITED STATES CHILE INDIA NISUM.COM P. 10
Cassandra
– Continuous Availability
– Linear Scalability
– No single point of failure
– Spans multiple DC’s
– Powerful Dynamic Data Model
• Maximum Flexibility
• Fast response
• 2 billion columns per row
– Open Source
– NoSQL
– 3.10
– Java
– Walmart
– Facebook
– Twitter
– Netflix
Operational Complexities
11. UNITED STATES CHILE INDIA NISUM.COM P. 11
Careful Cassandra
Teams often misunderstand the use case for Cassandra and
use it as general purpose DB. It’s a great tool and we like it,
but too often we see teams run into trouble using it.
Require joins or complex search? Predefined indexes/keys
Say, no Cassandra Yeah ….Cassandra
12. UNITED STATES CHILE INDIA NISUM.COM P. 12
Cassandra Careful - Lessons
It’s a great tool and we like it, but too often we see teams run
into trouble using it• Data Modeling is not simple: We saw cases where engineers re-modelled entire
databases multiple times to meet changing business needs.
• Not a general purpose database: It is optimized for fast reads on large data sets
based on predefined keys or indexes
• Time series: Suitable for storing time series data or metrics.
• Require Processing at Retrieval? If your use case require complex filtering or
processing when retrieving data, then Cassandra may not be the right choice for
you.
• Not Row Level Consistent: Data integrity challenges for non-key columns.
• Operational Complexities: Require careful planning and considerations
13. UNITED STATES CHILE INDIA NISUM.COM P. 13
Design Considerations – Success Factors
It’s a great tool and we like it, but too often we see teams run
into trouble using it
• In depth “underlying architecture” understanding
• infrastructure awareness
• proactive “capacity planning”
Is key to succeed….
Cassandra Underlying Architecture AWS – Regions and Zones
14. UNITED STATES CHILE INDIA NISUM.COM P. 14
TEST
Design Considerations
CPU
Cassandra is highly
concurrent and uses as
many CPU cores as
available
Insert heavy use cases
are CPU bound.
AWS - at least 4 vCPU's
AWS - Choose
computing optimized
instance types for heavy
inserts
Memory
Runs on JVM – properly
heap size , avoid too
large heaps
MAX_HEAP_SIZE not
more than 8 GB.
HEAP_NEW_SIZE,
100MB per vCPU
Leave enough memory
for OS file cache
AWS - 32GB RAM
Storage
mostly sequential, but
require random I/O
SSD preferred – low
latency for random
reads, and high
performance for
sequential writes for
compactions
Storage requirements -
storage overhead for
compaction
Adopt XFS or Ext4 file
system… avoid Ext3
Network
Gossip/Replication –
heavy traffic. At least 1
Gbps bandwidth
Spread across Regions
& Zones i.e DC”s and
racks. SNITCH settings
AWS - choose enhanced
networking.
VPC – private subnets =
replication factor. IP
Scheme
AWS - Use ENI - for
seeds. And spread
seeds across zones
15. DATA IS NOT BIG, BUT CHALLENGE IS WITH TECHNOLOGY
CHOICE OVERLOAD
DATA WILL KEEP GROWING, AS INTERNET IS TURNING UPSIDE
DOWN
CONSIDER LAMBDA ARCHITECTURE – IT CATERS MANY USE
CASES
CASSANDRA CAREFUL – IT IS NOT FOR EVERYONE
SUMMARY
@nisumtech
16. UNITED STATES CHILE INDIA NISUM.COM P. 16
Faraz Mohammed
VP, INNOVATION & PRODUCT
714-204-7712
mfaraz@nisum.com
THANK YOU
www.nisum.com
500 S. Kraemer Boulevard, Suite 301, Brea, CA 92821
Building SuccessTogether®
@Captain_Faraz
We’re hiring….
Editor's Notes
2008
ENI – Elastic Network Interfaces
Storage
Most of the I/O happening in Cassandra is sequential, but ther are cases where you require random I/O … an example is when reading Sstables during read operations.
SSD is recommended storage .. As it provides exremely low-latench response times for random read operations, while supporing ample sequential write performance for compaction operations.
Replication and storage overhead due to compaction … should be taken into account while determining storage requirements.
Recommend file system for all volumes is XFS. Ext4 can be used, but avoid Ext3, as it is considerably slower.
Networking
Cassandra uses Goassip protocol to exchange information with other nodes about network topology…. Involves talking to multiple nodes for read/write… results in a lot of data transfer thru the network.
We recommend to at always choose instances with at least 1 Gbps network bandwidth… to accommodate replication and Gossip.
If you use AWS – choosed Enhanced networking enabled on your instances… for better performance.
Ensure to use VPC, keep nodes in private subnet, and create as many subnets as replication-factor. Use NAT for translation.
Another thing to account for while planning subnects for your cassandra cluster is that Amazon reserves the first four IP addresses and the last IP address of every subnet for IP networking purposes
Use Elastic Network Interfaces – ENI - It’s a virtual network interface, can be used for managing SEED server
Memory
Cassandra primarily runs on JVM
The JVM has to be appropriately sized for performance.
Large heaps can introduce garbage collection (GC) pauses that can lead to latency or even make a Cassandra node appear to have gone offline.
Proper heap settings can minimize the impact of GC in JVM
The MAX_HEAP_SIZE parameter determines the heap size of the Cassandra JVM. DataStax recommends not to allocate more than 8 GB for the heap.
The HEAP_NEW_SIZE parameter is the size of the you generateion in Java. A general rule of thumb is to set this value at 100MB per vCPU
Cassandra also largely depends on the OS file cache for read performance. Hence choosing an optimum JVM heap size and leaving enough memory for OS file cache is important…. For product workloads we recommend to at least go with 32GB of DRAM
CPU
Insert-heavy workloads are CPU-bound in cassandra before becoming IO-bound. In other words all write operations go to the commit log, but cassandra is so efficient in writing, that the CPU becomes the limiting factor. Cassandra is highly concurrent and uses as many CPU cores as available.
Recommend at least 4 vCPU’s … test it before you settle.
Others
AWS – choose memory optimized or storage optimized instance types.
Test a representative workload… before choosing the final instance types.
Spread acorss AZ… so in case of disaster you can still ensure availability & uptime
Application
When building Cassandra cluster, select the same region for your data and application … to minimize application latency
Cassandra cluster can be made Amazon EC2 aware …. Thus support high availability by defining an appropriate snitch settings. This allows cassandra to place the replicas for data partitions on nodes that are in different AZ.s
Spread your seed nodes across multiple availability zones … seed nodes help bootstrap new nodes
Cassandra nodes can be Datacenter aware or Rack aware. DataCenter = AWS Region, and Rack = Zones.
Replication of the region cloud also server as backups
Launch cassandra in a VPC … it supports enhanced networking feature… means low latency
For example, go with /16 class instead of /28 class, as latter has only 14 IP address
AWS
Regions are independent, but AZ’s are connected on low-latency.
Communication between regions is on public-internet … so ensure encryption.
Also, there charge for data transfer between regions … it’s a nasty surprise.
…no fee between transfers – AWS Kinesis
…one way fee, not two way – S3 across regions
... Two way cost, IN and OUT – transferring across EC2’s in different AZ’s