This document provides an overview of Apache Cassandra including its history, architecture, data modeling concepts, and how to install and use it with Python. Key points include that Cassandra is a distributed, scalable NoSQL database designed without single points of failure. It discusses Cassandra's architecture including nodes, datacenters, clusters, commit logs, memtables, and SSTables. Data modeling concepts explained are keyspaces, column families, and designing for even data distribution and minimizing reads. The document also provides examples of creating a keyspace, reading data using Python driver, and demoing data clustering.
Professional Resume Template for Software Developers
Getting Started with Apache Cassandra and Python - A Comprehensive Guide
1. Getting started with Apache
Cassandra and Python
By
Adnan Siddiqi
(http://adnansiddiqi.me)
2. What is Apache Cassandra?
• According to Wikipedia:
Apache Cassandra is a free and open-source,
distributed, wide column store, NoSQL database
management system designed to handle large
amounts of data across many commodity servers,
providing high availability with no single point of
failure. Cassandra offers robust support for
clusters spanning multiple datacenters,[1] with
asynchronous masterless replication allowing low
latency operations for all clients.
3. History
• Developed by two Facebook engineers to deal
with search mechanism of Inbox.
• Released as an open-source project after few
years.
• Handed over to Apache Foundation.
6. Architecture(Contd…)
• Node:- The basic component of the data, a
machine where the data is stored.
• Datacenter:- A collection of related nodes. It
can be a physical datacenter or virtual.
• Cluster:- A cluster contains one or more
datacenters, it could span across locations.
• Commit Log:- Every write operation is first
stored in the commit log. It is used for crash
recovery.
7. Architecture(Contd…)
• Mem-Table:- After data is written to the
commit log it then is stored in Mem-
Table(Memory Table) which remains there till
it reaches to the threshold.
• SSTable:- Sorted-String Table or SSTable is a
disk file which stores data from MemTable
once it reaches to the threshold. SSTables are
stored on disk sequentially and maintained for
each database table.
9. Write Operations(Contd…)
• Write request is stored in both CommitLog to
make sure that data is saved.
• Data is written in Memtable which holds data
till it reaches to threshold.
• Data is flused to SSTable once Memtable
reaches to its threshold.
• The node that accepts requests called
Coordinator.
10. Read Operations
• Direct Request:- The coordinator node sends
the read request to one of the replicas.
• Digest:- The coordinator contacts the replicas
specified by the consistency level. The
contacted nodes respond with a digest
request of the required data. Comparison
takes place to make sure that the update data
is sent back.
12. Simple Strategy
• It is used when you have only one data center.
It places the first replica on the node selected
by the partitioner. A partitioner determines
how data is distributed across the nodes in the
cluster (including replicas). After that,
remaining replicas are placed in a clockwise
direction in the Node ring.
14. Network Topology Strategy
• Deployments across multiple Datacenters.
• This strategy places replicas in the same
datacenter by traversing the ring clockwise
until reaching the first node in another rack.
• This strategy is highly recommended for
scalability purpose and future expansion.
16. Installation and Setup
• Dockerized Version.
• docker pull cassandra
• Make sure to set the Docker memory to 4GB
atleast to avoid 137 exit error code.
22. Cassandra Data Modeling
• Keyspace:- It is the container collection of
column families. You can think of it as a
Database in the RDBMS world.
• Column Family:- A column family is a
container for an ordered collection of rows.
Each row, in turn, is an ordered collection of
columns. Think of it as a Table in the RDBMS
world.