SlideShare a Scribd company logo
A Comprehensive Introduction to
Apache Cassandra
Saeid Zebardast
@saeidzeb
zebardast.com
Feb 2015
Agenda
● What is NoSQL?
● What is Cassandra?
● Architecture
● Data Model
● Key Features and Benefits
● Hardware
● Directories and Files
● Cassandra Tools
○ CQL
○ Nodetool
○ DataStax Opscenter
● Backup and Restore
● Who’s using Cassandra?
2
What is NoSQL?
● NoSQL (Not Only SQL)
● Simplicity of Design
● Horizontal Scaling (Scale Out)
○ Add nodes to the Cluster as much as you wish
○ Not all NoSQL databases.
● Finer Control over availability
● Data Structure
○ Key-Value
○ Column-Oriented
○ Graph
○ Document-Oriented
○ And etc.
3
What is Cassandra?
● Since 2008 - Current stable version 2.1.2 (Nov 2014)
● NoSQL
● Distributed
● Open source
● Written in Java
● High performance
● Extremely scalable
● Fault tolerant (i.e no SPOF)
4
Architecture Highlights
● Scale out, not up
● Peer-to-Peer, distributed system
○ All nodes the same - masterless with no SPOF
● Online load balancing, cluster growth
● Understanding System/Hardware failures
● Custom data replication to ensure fault tolerance
● CAP theorem (Consistency, Availability, Partition tolerance)
○ You can not have the tree at the same time
○ Tradeoff between consistency and latency are tunable
○ Strong Consistency = Increased Latency
● Each node communicates with each other
○ through the Gossip protocol
5
Architecture Layers
Core Layer Middle Layer Top Layer
● Messaging service
● Gossip Failure detection
● Cluster state
● Partitioner
● Replication
● Commit log
● Memtable
● SSTable
● Indexes
● Compaction
● Tombstones
● Hinted handoff
● Read repair
● Bootstrap
● Monitoring
● Admin tools
Architecture Layers
6
Architecture of a write
1. At first write to a disk commit log (sequential).
2. After write to commit log, it is sent to the appropriate nodes.
3. Each node receiving write, first records it in a local log, then makes update to appropriate Memtables (one for each column family).
○ Memtable is in-memory representation of data (before the data gets flushed to disk as an SSTable).
○ Memtables are flushed to disk when:
■ Out of space
■ Too many keys (128 is default)
■ Time duration (Client provided - no cluster clock)
4. When Memtables written out two files go out:
○ Data File (SSTable).
○ Index File (SSTable Index)
5. When a commit log has had all its column families pushed to disk, it is deleted.
6. Compaction
○ Periodically data files are merged sorted into a new file.
○ Merge keys
○ Combine columns
○ Discard tombstones
7
Data Model
● [Keyspace][ColumnFamily][Key][Column]
● A keyspace is akin to a database in RDBMS
● The keyspace is a row-oriented, column structure
● A column family is similar to an RDBMS table
○ More flexible/dynamic
● A row in a column family is indexed by its key (Primary Key).
○ Cassandra supports up to 2 billion columns per (physical) row.
● Sample code to create keyspace and column family:
○ CREATE KEYSPACE logs WITH replication = {'class': 'SimpleStrategy',
'replication_factor': 1} ;
○ CREATE TABLE logs.samples (
node_id text,
metric text,
collection_ts timestamp,
value bigint,
PRIMARY KEY ((node_id, metric), collection_ts)
) WITH CLUSTERING ORDER BY (collection_ts DESC);
8
Data Model - Primary Keys
● Primary Keys are unique.
● Single Primary Key
○ PRIMARY KEY(keyColumn)
● Composite Primary Key
○ PRIMARY KEY (myPartiotionKey, my1stClusteringKey, my2stClusteringKey)
● Composite Partitioning Key
○ PRIMARY KEY ((my1PartiotionKey ,my2PartiotionKey), myClusteringKey)
9
Data Model - Time-To-Live (TTL)
● TTL a row
○ INSERT INTO users (id, first, last) VALUES (‘abc123’, ‘saeid’, ‘zeb’)
USING TTL 3600; //Expires data in one our
● TTL a column
○ UPDATE users USING TTL 30 SET last = ‘zebardast’ WHERE id = ‘abc123’;
● TTL is in seconds
● Can also set default TTL at a table level.
● Expired columns/rows automatically deleted.
● With no TTL specified, columns/values never expire.
● TTL is useful for automatic deletion.
● Re-inserting the same row before it expires will overwrite TTL.
10
Partitioners - Consistent hashing
● A partitioner determines how data is distributed across the nodes in the cluster (including replicas).
● A partitioner is a function for deriving a token representing a row from its partition key (typically by hashing).
11
name email gender
Saeid saeid@domain.com M
Kamyar kamyar@domain.com M
Nazanin nazanin@domain.com F
Masoud masoud@domain.com M
partition key Murmur3 hash value
Saeid -2245462676723223822
Kamyar 7723358927203680754
Nazanin -6723372854036780875
Masoud 1168604627387940318
Cassandra places the data on each
node according to the value of
partition key and the range that the
node is responsible for.
Node Start range End range Partition key Hash value
A -9223372036854775808 -4611686018427387903 Saeid -6723372854036780875
B -4611686018427387904 -1 Kamyar -2245462676723223822
C 0 4611686018427387903 Nazanin 1168604627387940318
D 4611686018427387904 9223372036854775807 Masoud 7723358927203680754
Cassandra assigns a hash value to each partition
key
Key Features and Benefits
● Gigabyte to Petabyte scalability
● Linear performance
● No SPOF
● Easy replication / data distribution
● Multi datacenter and cloud capable
● No need for separate caching layer
● Tunable data consistency
● Flexible schema design
● Data compaction
● CQL Language (like SQL)
● Support for key languages and platforms
● No need for special hardware or software
12
Big Data Scalability
● Capable of comfortably scaling to petabytes
● New nodes = linear performance increase
● Add new nodes online
13
No Single Point of Failure
● All nodes the same
○ Peer-to-Peer - masterless
● Customized replication affords tunales data redundancy
● Read/Write from any node
● Can replicate data among different physical data center racks
14
Easy Replication / Data Distribution
● Transparently handled by Cassandra
● Multi-data center capable
● Exploits all the benefits of Cloud computing
● Able to do Hybrid Cloud/On-Premise setup
15
No Need for Caching Software
● Peer-to-Peer architecture
○ removes need for special caching layer
● The database cluster uses the memory from all participating nodes to cache the data assigned
to each node.
● No irregularities between a memory cache and database are encountered
16
Tunable Data Consistency
● Choose between strong and eventual consistency
○ Depends on the need
● Can be done on a per operation basis, and for both read and writes.
● Handle Multi-data center operations
● Consistency Level (CL)
○ ALL = all replicas ack
○ QUORUM = > 51% of replicas ack
○ ONE = only one replica ack
○ Plus more… (see docs)
17
Flexible Schema
● Dynamic schema design
● Handles structured, semi-structured, and unstructured data.
● Counters is supported
● No offline/downtime for schema changes
● Support primary and secondary indexes
○ Secondary indexes != Relational Indexes (They are not for convenient not speed)
18
Data Compaction
● Use Google’s Snappy data compression algorithm
● Compresses data on a per column family level
● Internal tests at DataStax show up to 80%+ compression on row data
● No performance penalty
○ Some increases in overall performance due to less physical I/O
19
Locally Distributed
● Client reads or writes to any node
● Node coordinates with others
● Data read or replicated in parallel
● Replication info
○ Replication Factor (RF): How many copy of your data?
○ Each node is storing (RF/Cluster Size)% of the clusters total data.
○ Handy Calculator: http://www.ecyrd.com/cassandracalculator/
20
Rack Aware
● Cassandra is aware of which rack (or availability zone) each node resides in.
● It will attempt to place each data copy in different rack.
21
Data Center Aware
● Active Everywhere - reads/writes in multiple data centers
● Client writes local
● Data syncs across WAN
● Replication Factor per DC
● Different number of nodes per data center
22
Node Failure
● A single node failure shouldn’t bring failure.
● Replication Factor + Consistency Level = Success
23
Node Recovery
● When a write is performed and a replica node for the row is unavailable the coordinator will store a hint locally.
● When the node recovers, the coordinator replays the missed writes.
● Note: a hinted write does not count towards the consistency level.
● Note: you should still run repairs across your cluster.
24
Security in Cassandra
● Internal Authentication
○ Manages login IDs and passwords inside the database.
● Object Permission Management
○ Controls who has access to what and who can do what in the database
○ Uses familiar GRANT/REVOKE from relational systems.
● Client to Node Encryption
○ Protects data in flight to and from a database
25
Hardware
● RAM
○ The more memory a Cassandra node has, the better read performance.
■ For dedicated hardware, the optimal price-performance sweet spot is 16GB to 64GB; the minimum is 8GB.
■ For a virtual environments, the optimal range may be 8GB to 16GB; the minimum is 4GB.
● CPU
○ More cores is better. Cassandra is built with concurrency in mind.
■ For dedicated hardware, 8-core CPU processors are the current price-performance sweet spot.
■ For virtual environments, consider using a provider that allows CPU bursting, such as Rackspace.
● Disk
○ Cassandra tries to minimize random IO. Minimum of 2 disks. Keep CommitLog and Data (SSTable) on separate
spindles. RAID10 or RAID0 as you see fit.
○ XFS or ext4.
● Network
○ Be sure that your network can handle traffic between nodes without bottlenecks.
■ Recommended bandwidth is 1000 Mbit/s (gigabit) or greater.
● More info: Selecting hardware for enterprise implementations...
26
Directories and Files
● Configs
○ The main configuration file for Cassandra
■ /etc/cassandra/cassandra.yaml
○ Java Virtual Machine (JVM) configuration settings
■ /etc/cassandra/cassandra-env.sh
● Data directories
○ /var/lib/cassandra
● Log directory
○ /var/log/cassandra
● Environment settings
○ /usr/share/cassandra
● Cassandra user limits
○ /etc/security/limits.d/cassandra.conf
● More info: Package installation directories...
27
CQL Language
● Very similar to RDBMS SQL syntax
● Create objects via DDL (e.g. CREATE)
● Core DML commands supported: INSERT, UPDATE, DELETE
● Query data with SELECT
● cqlsh, the Python-based command-line client
○ CASSANDRA_PATH/bin/cqlsh
● More info: https://cassandra.apache.org/doc/cql/CQL.html
28
Nodetool
● A command line interface for managing a cluster.
○ CASSANDRA_PATH/bin/nodetool
● Useful commands:
○ nodetool info - Display node info (uptime, load and etc.).
○ nodetool status [keyspace] - Display cluster info (state, load and etc.).
○ nodetool cfstats [keyspace] - Display statistics of column families.
○ nodetool tpstats - Display usage statistics of thread pool.
○ nodetool netstats - Display network information.
○ nodetool repair - Repair one or more column families.
○ nodetool rebuild - Rebuild data by streaming from other nodes (similarly to bootstrap).
○ nodetool drain - Flush Memtables to SSTables on disk and stop accepting writes. Useful before a restart to make startup
quick.
○ nodetool flush [keyspace [columnfamily]] - Flushes one or more column families from the memtable.
○ nodetool cfhistograms keyspace columnfamily - Display statistic histograms for a given column family.
○ nodetool proxyhistograms - Display statistic histograms for network operations.
○ nodetool help - Display help information!
29
Backup and Restore
● Take Snapshot
○ nodetool snapshot
■ /var/lib/cassandra/keyspace_name/table_name-UUID/snapshots/snapshot_name
○ nodetool clearsnapshot
● Restore Procedure
○ Shutdown the node.
○ Clear all files in the commitlog directory (/var/lib/cassandra/commitlog)
○ Delete all *.db files in data_directory_location/keyspace_name/table_name-UUID directory.
○ Locate the most recent snapshot folder in this directory:
■ data_directory_location/keyspace_name/table_name-UUID/snapshots/snapshot_name
○ Copy its contents into this directory:
■ data_directory_location/keyspace_name/table_name-UUID
○ Start the node
■ Restarting causes a temporary burst of I/O activity and consumes a large amount of CPU resources.
○ Run nodetool repair
● More info: Restoring from a Snapshot...
30
DataStax Opscenter
● Visually create new clusters with a few mouse clicks either on premise or in the cloud
● Add, edit, and remove nodes
● Automatically rebalance a cluster
● Control automatic management services including transparent repair
● Manage and schedule backup and restore operations
● Perform capacity planning with historical trend analysis and forecasting capabilities
● Proactively manage all clusters with threshold and timing-based alerts
● Generate reports and diagnostic reports with the push of a button
● Integrate with other enterprise tools via developer API
● More info: http://www.datastax.com/datastax-opscenter
31
Who’s Using Cassandra?
● Apple
● CERN
● Cisco
● Digg
● Facebook
● IBM
● Instagram
● Mahalo.com
● Netflix
● Rackspace
● Reddit
● SoundCloud
● Spotify
● Twitter
● Zoho
● http://planetcassandra.org/companies/
32
Where Can I Learn More?
● https://cassandra.apache.org/
● http://planetcassandra.org/
● http://www.datastax.com
33
Thank you
Saeid Zebardast
@saeidzeb
zebardast.com
Feb 2015
Any
Questions,
Comments?
34

More Related Content

What's hot

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Firewall Design and Implementation
Firewall Design and ImplementationFirewall Design and Implementation
Firewall Design and Implementation
ajeet singh
 
RPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy NetworksRPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy Networks
Pradeep Kumar TS
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
Rahul Jain
 
UDP - User Datagram Protocol
UDP - User Datagram ProtocolUDP - User Datagram Protocol
UDP - User Datagram Protocol
Peter R. Egli
 
Gateway Networking
Gateway NetworkingGateway Networking
Gateway Networking
Abhishek Kumar Ravi
 
Series-and-Parallel-Algorithm.pptx
Series-and-Parallel-Algorithm.pptxSeries-and-Parallel-Algorithm.pptx
Series-and-Parallel-Algorithm.pptx
BikashKhanal15
 
Chapter 17 - Distributed File Systems
Chapter 17 - Distributed File SystemsChapter 17 - Distributed File Systems
Chapter 17 - Distributed File Systems
Wayne Jones Jnr
 
3.Medium Access Control
3.Medium Access Control3.Medium Access Control
3.Medium Access ControlSonali Chauhan
 
Osi reference model
Osi reference modelOsi reference model
Osi reference model
Sagar Gor
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
Tanu Siwag
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Jignesh Shah
 
Caching
CachingCaching
Caching
Nascenia IT
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
Using all of the high availability options in MariaDB
Using all of the high availability options in MariaDBUsing all of the high availability options in MariaDB
Using all of the high availability options in MariaDB
MariaDB plc
 
QoS (quality of service)
QoS (quality of service)QoS (quality of service)
QoS (quality of service)
Sri Safrina
 
Internetworking devices
Internetworking devicesInternetworking devices
Internetworking devices
Online
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
vanjakom
 
IEEE standards 802.3.&802.11
IEEE standards 802.3.&802.11IEEE standards 802.3.&802.11
IEEE standards 802.3.&802.11
Keshav Maheshwari
 

What's hot (20)

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Firewall Design and Implementation
Firewall Design and ImplementationFirewall Design and Implementation
Firewall Design and Implementation
 
AMQP
AMQPAMQP
AMQP
 
RPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy NetworksRPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy Networks
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
UDP - User Datagram Protocol
UDP - User Datagram ProtocolUDP - User Datagram Protocol
UDP - User Datagram Protocol
 
Gateway Networking
Gateway NetworkingGateway Networking
Gateway Networking
 
Series-and-Parallel-Algorithm.pptx
Series-and-Parallel-Algorithm.pptxSeries-and-Parallel-Algorithm.pptx
Series-and-Parallel-Algorithm.pptx
 
Chapter 17 - Distributed File Systems
Chapter 17 - Distributed File SystemsChapter 17 - Distributed File Systems
Chapter 17 - Distributed File Systems
 
3.Medium Access Control
3.Medium Access Control3.Medium Access Control
3.Medium Access Control
 
Osi reference model
Osi reference modelOsi reference model
Osi reference model
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
 
Caching
CachingCaching
Caching
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Using all of the high availability options in MariaDB
Using all of the high availability options in MariaDBUsing all of the high availability options in MariaDB
Using all of the high availability options in MariaDB
 
QoS (quality of service)
QoS (quality of service)QoS (quality of service)
QoS (quality of service)
 
Internetworking devices
Internetworking devicesInternetworking devices
Internetworking devices
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
 
IEEE standards 802.3.&802.11
IEEE standards 802.3.&802.11IEEE standards 802.3.&802.11
IEEE standards 802.3.&802.11
 

Similar to An Introduction to Apache Cassandra

Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
JWORKS powered by Ordina
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overviewSean Murphy
 
Redis as a Main Database, Scaling and HA
Redis as a Main Database, Scaling and HARedis as a Main Database, Scaling and HA
Redis as a Main Database, Scaling and HA
Dave Nielsen
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
DataStax Academy
 
Cassandra
CassandraCassandra
Cassandra
Upaang Saxena
 
Redshift
RedshiftRedshift
Redshift
Paulo Kieffer
 
Introduction to Apache Cassandra
Introduction to Apache Cassandra Introduction to Apache Cassandra
Introduction to Apache Cassandra
Knoldus Inc.
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & Architectures
Daniel Marcous
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
Artur Mkrtchyan
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
András Fehér
 
Hadoop and cassandra
Hadoop and cassandraHadoop and cassandra
Hadoop and cassandraChristina Yu
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
ScyllaDB
 
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and OutlookLinux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Danny Al-Gaaf
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
Omid Vahdaty
 
Distributed unique id generation
Distributed unique id generationDistributed unique id generation
Distributed unique id generation
Tung Nguyen
 
NewSQL - The Future of Databases?
NewSQL - The Future of Databases?NewSQL - The Future of Databases?
NewSQL - The Future of Databases?
Elvis Saravia
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
Severalnines
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big DataDataStax Academy
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
Gluster.org
 

Similar to An Introduction to Apache Cassandra (20)

Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
 
Redis as a Main Database, Scaling and HA
Redis as a Main Database, Scaling and HARedis as a Main Database, Scaling and HA
Redis as a Main Database, Scaling and HA
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Cassandra
CassandraCassandra
Cassandra
 
Redshift
RedshiftRedshift
Redshift
 
Introduction to Apache Cassandra
Introduction to Apache Cassandra Introduction to Apache Cassandra
Introduction to Apache Cassandra
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & Architectures
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Hadoop and cassandra
Hadoop and cassandraHadoop and cassandra
Hadoop and cassandra
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and OutlookLinux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Distributed unique id generation
Distributed unique id generationDistributed unique id generation
Distributed unique id generation
 
NewSQL - The Future of Databases?
NewSQL - The Future of Databases?NewSQL - The Future of Databases?
NewSQL - The Future of Databases?
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 

More from Saeid Zebardast

Web Components Revolution
Web Components RevolutionWeb Components Revolution
Web Components Revolution
Saeid Zebardast
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Saeid Zebardast
 
An overview of Scalable Web Application Front-end
An overview of Scalable Web Application Front-endAn overview of Scalable Web Application Front-end
An overview of Scalable Web Application Front-end
Saeid Zebardast
 
MySQL Cheat Sheet
MySQL Cheat SheetMySQL Cheat Sheet
MySQL Cheat Sheet
Saeid Zebardast
 
Java Cheat Sheet
Java Cheat SheetJava Cheat Sheet
Java Cheat Sheet
Saeid Zebardast
 
Developing Applications with MySQL and Java for beginners
Developing Applications with MySQL and Java for beginnersDeveloping Applications with MySQL and Java for beginners
Developing Applications with MySQL and Java for beginners
Saeid Zebardast
 
Java for beginners
Java for beginnersJava for beginners
Java for beginners
Saeid Zebardast
 
MySQL for beginners
MySQL for beginnersMySQL for beginners
MySQL for beginners
Saeid Zebardast
 
هفده اصل افراد موثر در تیم
هفده اصل افراد موثر در تیمهفده اصل افراد موثر در تیم
هفده اصل افراد موثر در تیم
Saeid Zebardast
 
What is good design?
What is good design?What is good design?
What is good design?
Saeid Zebardast
 
How to be different?
How to be different?How to be different?
How to be different?
Saeid Zebardast
 
What is REST?
What is REST?What is REST?
What is REST?
Saeid Zebardast
 
معرفی گنو/لینوکس و سیستم عامل های متن باز و آزاد
معرفی گنو/لینوکس و سیستم عامل های متن باز و آزادمعرفی گنو/لینوکس و سیستم عامل های متن باز و آزاد
معرفی گنو/لینوکس و سیستم عامل های متن باز و آزاد
Saeid Zebardast
 

More from Saeid Zebardast (13)

Web Components Revolution
Web Components RevolutionWeb Components Revolution
Web Components Revolution
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
An overview of Scalable Web Application Front-end
An overview of Scalable Web Application Front-endAn overview of Scalable Web Application Front-end
An overview of Scalable Web Application Front-end
 
MySQL Cheat Sheet
MySQL Cheat SheetMySQL Cheat Sheet
MySQL Cheat Sheet
 
Java Cheat Sheet
Java Cheat SheetJava Cheat Sheet
Java Cheat Sheet
 
Developing Applications with MySQL and Java for beginners
Developing Applications with MySQL and Java for beginnersDeveloping Applications with MySQL and Java for beginners
Developing Applications with MySQL and Java for beginners
 
Java for beginners
Java for beginnersJava for beginners
Java for beginners
 
MySQL for beginners
MySQL for beginnersMySQL for beginners
MySQL for beginners
 
هفده اصل افراد موثر در تیم
هفده اصل افراد موثر در تیمهفده اصل افراد موثر در تیم
هفده اصل افراد موثر در تیم
 
What is good design?
What is good design?What is good design?
What is good design?
 
How to be different?
How to be different?How to be different?
How to be different?
 
What is REST?
What is REST?What is REST?
What is REST?
 
معرفی گنو/لینوکس و سیستم عامل های متن باز و آزاد
معرفی گنو/لینوکس و سیستم عامل های متن باز و آزادمعرفی گنو/لینوکس و سیستم عامل های متن باز و آزاد
معرفی گنو/لینوکس و سیستم عامل های متن باز و آزاد
 

Recently uploaded

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 

Recently uploaded (20)

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

An Introduction to Apache Cassandra

  • 1. A Comprehensive Introduction to Apache Cassandra Saeid Zebardast @saeidzeb zebardast.com Feb 2015
  • 2. Agenda ● What is NoSQL? ● What is Cassandra? ● Architecture ● Data Model ● Key Features and Benefits ● Hardware ● Directories and Files ● Cassandra Tools ○ CQL ○ Nodetool ○ DataStax Opscenter ● Backup and Restore ● Who’s using Cassandra? 2
  • 3. What is NoSQL? ● NoSQL (Not Only SQL) ● Simplicity of Design ● Horizontal Scaling (Scale Out) ○ Add nodes to the Cluster as much as you wish ○ Not all NoSQL databases. ● Finer Control over availability ● Data Structure ○ Key-Value ○ Column-Oriented ○ Graph ○ Document-Oriented ○ And etc. 3
  • 4. What is Cassandra? ● Since 2008 - Current stable version 2.1.2 (Nov 2014) ● NoSQL ● Distributed ● Open source ● Written in Java ● High performance ● Extremely scalable ● Fault tolerant (i.e no SPOF) 4
  • 5. Architecture Highlights ● Scale out, not up ● Peer-to-Peer, distributed system ○ All nodes the same - masterless with no SPOF ● Online load balancing, cluster growth ● Understanding System/Hardware failures ● Custom data replication to ensure fault tolerance ● CAP theorem (Consistency, Availability, Partition tolerance) ○ You can not have the tree at the same time ○ Tradeoff between consistency and latency are tunable ○ Strong Consistency = Increased Latency ● Each node communicates with each other ○ through the Gossip protocol 5
  • 6. Architecture Layers Core Layer Middle Layer Top Layer ● Messaging service ● Gossip Failure detection ● Cluster state ● Partitioner ● Replication ● Commit log ● Memtable ● SSTable ● Indexes ● Compaction ● Tombstones ● Hinted handoff ● Read repair ● Bootstrap ● Monitoring ● Admin tools Architecture Layers 6
  • 7. Architecture of a write 1. At first write to a disk commit log (sequential). 2. After write to commit log, it is sent to the appropriate nodes. 3. Each node receiving write, first records it in a local log, then makes update to appropriate Memtables (one for each column family). ○ Memtable is in-memory representation of data (before the data gets flushed to disk as an SSTable). ○ Memtables are flushed to disk when: ■ Out of space ■ Too many keys (128 is default) ■ Time duration (Client provided - no cluster clock) 4. When Memtables written out two files go out: ○ Data File (SSTable). ○ Index File (SSTable Index) 5. When a commit log has had all its column families pushed to disk, it is deleted. 6. Compaction ○ Periodically data files are merged sorted into a new file. ○ Merge keys ○ Combine columns ○ Discard tombstones 7
  • 8. Data Model ● [Keyspace][ColumnFamily][Key][Column] ● A keyspace is akin to a database in RDBMS ● The keyspace is a row-oriented, column structure ● A column family is similar to an RDBMS table ○ More flexible/dynamic ● A row in a column family is indexed by its key (Primary Key). ○ Cassandra supports up to 2 billion columns per (physical) row. ● Sample code to create keyspace and column family: ○ CREATE KEYSPACE logs WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1} ; ○ CREATE TABLE logs.samples ( node_id text, metric text, collection_ts timestamp, value bigint, PRIMARY KEY ((node_id, metric), collection_ts) ) WITH CLUSTERING ORDER BY (collection_ts DESC); 8
  • 9. Data Model - Primary Keys ● Primary Keys are unique. ● Single Primary Key ○ PRIMARY KEY(keyColumn) ● Composite Primary Key ○ PRIMARY KEY (myPartiotionKey, my1stClusteringKey, my2stClusteringKey) ● Composite Partitioning Key ○ PRIMARY KEY ((my1PartiotionKey ,my2PartiotionKey), myClusteringKey) 9
  • 10. Data Model - Time-To-Live (TTL) ● TTL a row ○ INSERT INTO users (id, first, last) VALUES (‘abc123’, ‘saeid’, ‘zeb’) USING TTL 3600; //Expires data in one our ● TTL a column ○ UPDATE users USING TTL 30 SET last = ‘zebardast’ WHERE id = ‘abc123’; ● TTL is in seconds ● Can also set default TTL at a table level. ● Expired columns/rows automatically deleted. ● With no TTL specified, columns/values never expire. ● TTL is useful for automatic deletion. ● Re-inserting the same row before it expires will overwrite TTL. 10
  • 11. Partitioners - Consistent hashing ● A partitioner determines how data is distributed across the nodes in the cluster (including replicas). ● A partitioner is a function for deriving a token representing a row from its partition key (typically by hashing). 11 name email gender Saeid saeid@domain.com M Kamyar kamyar@domain.com M Nazanin nazanin@domain.com F Masoud masoud@domain.com M partition key Murmur3 hash value Saeid -2245462676723223822 Kamyar 7723358927203680754 Nazanin -6723372854036780875 Masoud 1168604627387940318 Cassandra places the data on each node according to the value of partition key and the range that the node is responsible for. Node Start range End range Partition key Hash value A -9223372036854775808 -4611686018427387903 Saeid -6723372854036780875 B -4611686018427387904 -1 Kamyar -2245462676723223822 C 0 4611686018427387903 Nazanin 1168604627387940318 D 4611686018427387904 9223372036854775807 Masoud 7723358927203680754 Cassandra assigns a hash value to each partition key
  • 12. Key Features and Benefits ● Gigabyte to Petabyte scalability ● Linear performance ● No SPOF ● Easy replication / data distribution ● Multi datacenter and cloud capable ● No need for separate caching layer ● Tunable data consistency ● Flexible schema design ● Data compaction ● CQL Language (like SQL) ● Support for key languages and platforms ● No need for special hardware or software 12
  • 13. Big Data Scalability ● Capable of comfortably scaling to petabytes ● New nodes = linear performance increase ● Add new nodes online 13
  • 14. No Single Point of Failure ● All nodes the same ○ Peer-to-Peer - masterless ● Customized replication affords tunales data redundancy ● Read/Write from any node ● Can replicate data among different physical data center racks 14
  • 15. Easy Replication / Data Distribution ● Transparently handled by Cassandra ● Multi-data center capable ● Exploits all the benefits of Cloud computing ● Able to do Hybrid Cloud/On-Premise setup 15
  • 16. No Need for Caching Software ● Peer-to-Peer architecture ○ removes need for special caching layer ● The database cluster uses the memory from all participating nodes to cache the data assigned to each node. ● No irregularities between a memory cache and database are encountered 16
  • 17. Tunable Data Consistency ● Choose between strong and eventual consistency ○ Depends on the need ● Can be done on a per operation basis, and for both read and writes. ● Handle Multi-data center operations ● Consistency Level (CL) ○ ALL = all replicas ack ○ QUORUM = > 51% of replicas ack ○ ONE = only one replica ack ○ Plus more… (see docs) 17
  • 18. Flexible Schema ● Dynamic schema design ● Handles structured, semi-structured, and unstructured data. ● Counters is supported ● No offline/downtime for schema changes ● Support primary and secondary indexes ○ Secondary indexes != Relational Indexes (They are not for convenient not speed) 18
  • 19. Data Compaction ● Use Google’s Snappy data compression algorithm ● Compresses data on a per column family level ● Internal tests at DataStax show up to 80%+ compression on row data ● No performance penalty ○ Some increases in overall performance due to less physical I/O 19
  • 20. Locally Distributed ● Client reads or writes to any node ● Node coordinates with others ● Data read or replicated in parallel ● Replication info ○ Replication Factor (RF): How many copy of your data? ○ Each node is storing (RF/Cluster Size)% of the clusters total data. ○ Handy Calculator: http://www.ecyrd.com/cassandracalculator/ 20
  • 21. Rack Aware ● Cassandra is aware of which rack (or availability zone) each node resides in. ● It will attempt to place each data copy in different rack. 21
  • 22. Data Center Aware ● Active Everywhere - reads/writes in multiple data centers ● Client writes local ● Data syncs across WAN ● Replication Factor per DC ● Different number of nodes per data center 22
  • 23. Node Failure ● A single node failure shouldn’t bring failure. ● Replication Factor + Consistency Level = Success 23
  • 24. Node Recovery ● When a write is performed and a replica node for the row is unavailable the coordinator will store a hint locally. ● When the node recovers, the coordinator replays the missed writes. ● Note: a hinted write does not count towards the consistency level. ● Note: you should still run repairs across your cluster. 24
  • 25. Security in Cassandra ● Internal Authentication ○ Manages login IDs and passwords inside the database. ● Object Permission Management ○ Controls who has access to what and who can do what in the database ○ Uses familiar GRANT/REVOKE from relational systems. ● Client to Node Encryption ○ Protects data in flight to and from a database 25
  • 26. Hardware ● RAM ○ The more memory a Cassandra node has, the better read performance. ■ For dedicated hardware, the optimal price-performance sweet spot is 16GB to 64GB; the minimum is 8GB. ■ For a virtual environments, the optimal range may be 8GB to 16GB; the minimum is 4GB. ● CPU ○ More cores is better. Cassandra is built with concurrency in mind. ■ For dedicated hardware, 8-core CPU processors are the current price-performance sweet spot. ■ For virtual environments, consider using a provider that allows CPU bursting, such as Rackspace. ● Disk ○ Cassandra tries to minimize random IO. Minimum of 2 disks. Keep CommitLog and Data (SSTable) on separate spindles. RAID10 or RAID0 as you see fit. ○ XFS or ext4. ● Network ○ Be sure that your network can handle traffic between nodes without bottlenecks. ■ Recommended bandwidth is 1000 Mbit/s (gigabit) or greater. ● More info: Selecting hardware for enterprise implementations... 26
  • 27. Directories and Files ● Configs ○ The main configuration file for Cassandra ■ /etc/cassandra/cassandra.yaml ○ Java Virtual Machine (JVM) configuration settings ■ /etc/cassandra/cassandra-env.sh ● Data directories ○ /var/lib/cassandra ● Log directory ○ /var/log/cassandra ● Environment settings ○ /usr/share/cassandra ● Cassandra user limits ○ /etc/security/limits.d/cassandra.conf ● More info: Package installation directories... 27
  • 28. CQL Language ● Very similar to RDBMS SQL syntax ● Create objects via DDL (e.g. CREATE) ● Core DML commands supported: INSERT, UPDATE, DELETE ● Query data with SELECT ● cqlsh, the Python-based command-line client ○ CASSANDRA_PATH/bin/cqlsh ● More info: https://cassandra.apache.org/doc/cql/CQL.html 28
  • 29. Nodetool ● A command line interface for managing a cluster. ○ CASSANDRA_PATH/bin/nodetool ● Useful commands: ○ nodetool info - Display node info (uptime, load and etc.). ○ nodetool status [keyspace] - Display cluster info (state, load and etc.). ○ nodetool cfstats [keyspace] - Display statistics of column families. ○ nodetool tpstats - Display usage statistics of thread pool. ○ nodetool netstats - Display network information. ○ nodetool repair - Repair one or more column families. ○ nodetool rebuild - Rebuild data by streaming from other nodes (similarly to bootstrap). ○ nodetool drain - Flush Memtables to SSTables on disk and stop accepting writes. Useful before a restart to make startup quick. ○ nodetool flush [keyspace [columnfamily]] - Flushes one or more column families from the memtable. ○ nodetool cfhistograms keyspace columnfamily - Display statistic histograms for a given column family. ○ nodetool proxyhistograms - Display statistic histograms for network operations. ○ nodetool help - Display help information! 29
  • 30. Backup and Restore ● Take Snapshot ○ nodetool snapshot ■ /var/lib/cassandra/keyspace_name/table_name-UUID/snapshots/snapshot_name ○ nodetool clearsnapshot ● Restore Procedure ○ Shutdown the node. ○ Clear all files in the commitlog directory (/var/lib/cassandra/commitlog) ○ Delete all *.db files in data_directory_location/keyspace_name/table_name-UUID directory. ○ Locate the most recent snapshot folder in this directory: ■ data_directory_location/keyspace_name/table_name-UUID/snapshots/snapshot_name ○ Copy its contents into this directory: ■ data_directory_location/keyspace_name/table_name-UUID ○ Start the node ■ Restarting causes a temporary burst of I/O activity and consumes a large amount of CPU resources. ○ Run nodetool repair ● More info: Restoring from a Snapshot... 30
  • 31. DataStax Opscenter ● Visually create new clusters with a few mouse clicks either on premise or in the cloud ● Add, edit, and remove nodes ● Automatically rebalance a cluster ● Control automatic management services including transparent repair ● Manage and schedule backup and restore operations ● Perform capacity planning with historical trend analysis and forecasting capabilities ● Proactively manage all clusters with threshold and timing-based alerts ● Generate reports and diagnostic reports with the push of a button ● Integrate with other enterprise tools via developer API ● More info: http://www.datastax.com/datastax-opscenter 31
  • 32. Who’s Using Cassandra? ● Apple ● CERN ● Cisco ● Digg ● Facebook ● IBM ● Instagram ● Mahalo.com ● Netflix ● Rackspace ● Reddit ● SoundCloud ● Spotify ● Twitter ● Zoho ● http://planetcassandra.org/companies/ 32
  • 33. Where Can I Learn More? ● https://cassandra.apache.org/ ● http://planetcassandra.org/ ● http://www.datastax.com 33
  • 34. Thank you Saeid Zebardast @saeidzeb zebardast.com Feb 2015 Any Questions, Comments? 34