3. History of Cassandra
• Apache Cassandra was born at Facebook for inbox
search. Facebook open sourced the code in 2008.
• Cassandra became an Apache Incubator project
in 2009 and subsequently became a top-level
Apache project in 2010.
• The latest version of Apache Cassandra is 3.1.1.
• It is a column-oriented database designed to support
peer-to-peer symmetric nodes instead of the master
slave architecture.
• It is built on Amazon’s dynamo and Google’s
BigTable.
cassandra ~= bigtable + dynamo
4.
5. What is Cassandra?
• Apache Cassandra is a highly scalable, high-performance
distributed database designed to handle large amounts of
structured data across many commodity servers with
replication, providing high availability and no single point
of failure.
6. • circles are Cassandra nodes and lines between the
circles shows distributed architecture, while the client
is sending data to the node. (Ring Architecture)
7. Notable points
• It is scalable, fault-tolerant, and consistent.
• It is a column-oriented database.
• Its distribution design is based on Amazon’s Dynamo and
its data model on Google’s Bigtable.
• Cassandra implements a Dynamo-style replication model
with no single point of failure, but adds a more powerful
“column family” data model.
• Cassandra is being used by some of the biggest
companies such as Facebook, Twitter, Cisco, Rackspace,
ebay, Adobe, Twitter, Netflix, and more.
8. Features of Cassandra
• Elastic scalability - Cassandra is highly scalable; it allows
to add more hardware to accommodate more customers
and more data as per requirement.
• Massively Scalable Architecture: Cassandra has a
masterless design where all nodes are at the same level
which provides operational simplicity and easy scale out.
• Always on architecture (peer-to-peer
network): Cassandra replicates data on different nodes
that ensures no single point of failure and it is
continuously available for business-critical applications.
• Linear Scale Performance: As more nodes are added,
the performance of Cassandra increases. Therefore it
maintains a quick response time.
9. Features of Cassandra
• Flexible data storage - Cassandra accommodates all possible
data formats including: structured, semi-structured, and
unstructured. It can dynamically accommodate changes to
data structures according to the need.
• Easy data distribution - Cassandra provides the flexibility to
distribute data where you need by replicating data across
multiple data centers.
• Transaction support - Cassandra supports properties like
Atomicity, Consistency, Isolation, and Durability (ACID).
• Fast writes - Cassandra was designed to run on cheap
commodity hardware. It performs blazingly fast writes and
can store hundreds of terabytes of data, without sacrificing
the read efficiency.
10. Features of Cassandra
• Fault Detection and Recovery: Failed nodes can easily be
restored and recovered.
• Flexible and Dynamic Data Model: Supports datatypes
with Fast writes and reads.
• Data Protection: Data is protected with commit log
design and build in security like backup and restore
mechanisms.
• Tunable Data Consistency: Support for strong data
consistency across distributed architecture.
• Multi Data Center Replication: Cassandra provides
feature to replicate data across multiple data center.
11. Features of Cassandra
• Data Compression: Cassandra can compress up to 80%
data without any overhead.
• Cassandra Query language (CQL): Cassandra provides
query language that is similar like SQL language. It makes
very easy for relational database developers moving
from relational database to Cassandra.
12. Cassandra Use Cases/Application
• Messaging: Cassandra is a great database for the
companies that provides Mobile phones and messaging
services. These companies have a huge amount of data,
so Cassandra is best for them.
• Internet of things Application: Cassandra is a great
database for the applications where data is coming at
very high speed from different devices or sensors.
• Product Catalogs and retail apps: Cassandra is used by
many retailers for durable shopping cart protection and
fast product catalog input and output.
13. Cassandra Use Cases/Application
• Social Media Analytics and recommendation engine:
Cassandra is a great database for many online companies
and social media providers for analysis and
recommendation to their customers.
14. Cassandra Architecture
• The design goal of Cassandra is to handle big data
workloads across multiple nodes without any single
point of failure.
• Cassandra has peer-to-peer distributed system across its
nodes, and data is distributed among all the nodes in a
cluster.
15.
16.
17. Components of Cassandra
• Node − It is the basic fundamental unit of
Cassandra. Data stores in these
units(computer/server).
• Data center − It is a collection of related
nodes.
• Cassandra Rack- A rack is a unit that contains
all the multiple servers, all stacked on top of
another. A node is a single server in a rack.
• Cluster − A cluster is a component that
contains one or more data centers.
18. Components of Cassandra
• Commit log − The commit log is a crash-recovery
mechanism in Cassandra. Every write operation is
written to the commit log.
• Mem-table − A mem-table is a memory-resident
data structure. After commit log, the data will be
written to the mem-table.
• SSTable − It is a disk file to which the data is
flushed from the mem-table when its contents
reach a threshold value.
19. A rack is a group of
machines housed in the
same physical box. Each
machine in the rack has
its own CPU, memory,
and hard disk. However,
the rack has no CPU,
memory, or hard disk of
its own.
•All machines in the rack are
connected to the network switch
of the rack
•The rack’s network switch is
connected to the cluster.
•All machines on the rack have a
common power supply. It is
important to notice that a rack
can fail due to two reasons: a
network switch failure or a power
supply failure.
•If a rack fails, none of the
machines on the rack can be
accessed. So it would seem as
though all the nodes on the rack
are down.
21. Cassandra Architecture
• All the nodes in a cluster play the same role. Each node is
independent and at the same time interconnected to other
nodes.
• Each node in a cluster can accept read and write requests,
regardless of where the data is actually located in the cluster.
• When a node goes down, read/write requests can be served
from other nodes in the network.
22. Data Replication in Cassandra
• In Cassandra, one or more of the nodes in a
cluster act as replicas for a given piece of data.
• If it is detected that some of the nodes
responded with an out-of-date value,
Cassandra will return the most recent value to
the client. After returning the most recent
value, Cassandra performs a read repair in the
background to update the stale (old) values.
• The RF lies between 1 and n (# of nodes)
23. Gossip protocol
• Cassandra uses the Gossip Protocol in the
background to allow the nodes to communicate with
each other and detect any faulty nodes in the
cluster.
• A gossip protocol is a style of computer-to-
computer communication protocol inspired by the
form of gossip seen in social networks.
• The term epidemic protocol is sometimes used as a
synonym for a gossip protocol, because gossip
spreads information in a manner similar to the
spread of a virus in a biological community.
24. Partitioner
• Used for distributing data on the various nodes in
a cluster.
• It also determines the node on which to place the
very first copy of the data.
• It is a hash function
25. Replication Factor
• The total number of replicas across the cluster is
referred to as the replication factor.
• The RF determines the number of copies of data
(replicas) that will be stored across nodes in a
cluster.
• A replication strategy determines the nodes
where replicas are placed.
– Simple Strategy:
– Network Topology Strategy.
26. Simple Strategy
• Use only for a single datacenter and one rack.
• Simple Strategy places the first replica on a node
determined by the partitioner. Additional replicas
are placed on the next nodes clockwise in the
ring.
• Simple Strategy which is rack unaware and data
center unaware policy i.e. without considering
topology (rack or datacenter location).
27.
28. Network Topology Strategy
• Network Topology Strategy is used when you have
more than two data centers.
• As the name indicates, this strategy is aware of the
network topology (location of nodes in racks, data
centers etc.) and is much intelligent than Simple
Strategy.
• This strategy specifies how many replicas you want in
each datacenter.
• Replicas are set for each data center separately. Rack
set of data for each data center place separately in a
clockwise direction on different racks of the same
data center. This process continues until it reaches
the first node.
29.
30. Anti-Entropy
• Anti-entropy is a process of comparing the data of
all replicas and updating each replica to the
newest version.
• Frequent data deletions and node failures are
common causes of data inconsistency.
• Anti-entropy node repairs are important for every
Cassandra cluster.
• Anti-entropy repair is used for routine
maintenance and when a cluster needs fixing.
31.
32. Writes path in Cassandra
• Cassandra processes data at several stages on the write path,
starting with the immediate logging of a write and ending in
compaction:
– Logging data in the commit log
– Writing data to the memtable
– Flushing data from the memtable
– Storing data on disk in SSTables
– Compaction
38. Hint table
• Location of the node on which the replica is to be
placed.
• Version metadata
• The actual data
• When node C recovers and is back to the functional,
node A reacts to the hint by forwarding the data to node
C.
39. Tunable Consistency (T C)
• Consistency refers to how up-to-date and synchronized a
row of Cassandra data is on all of its replicas.
• Tunable consistency = Strong C + Eventual C
• Strong Consistency:
– Each update propagates to all locations, and it
ensures all server should have a copy of the data
before it serves to the client.
– It has impact performance.
40. Eventual Consistency
• It implies that the client is acknowledged with a success
as soon as a part of the cluster acknowledges the write.
• It is used when application performance matter.
41. Read consistency
• It means how many replicas must respond before
sending out the result to the client applications.
• Consistency levels : next slide
42. ONE Returns a response from the closest
node (replica)
holding the data.
QUORUM Returns a result from a quorum of
servers with the most recent timestamp
for the data.
LOCAL_QUORU
M
Returns a result from a quorum of
servers with the most recent timestamp
for the data in the same data center as the
coordinator node.
EACH_QUORUM Returns a result from a quorum of
servers with the
most recent timestamp in all data centers.
ALL This provides the highest level of
consistency of all levels. It responds to a
read request from a client after all the
replica nodes have responded.
43. Write consistency
• It means on how many replicas , write must succeed
before sending out an ACK to the client application.
• Write consistency levels: next slide
47. CQLSH
• Cassandra provides Cassandra query language
shell (cqlsh) that allows users to communicate with
Cassandra.
• Using cqlsh, you can
• define a schema,
• insert data, and
• execute a query.
48. KEYSPACES (Database [Namespace])
• It is a container to hold application data like RDBMS.
• Used to group column families together.
• Each cluster has one keyspace/application or per
node.
• A keyspace (or key space) in a NoSQL data store is an
object that holds together all column families of a
design.
• It is the outermost grouping of the data in the data
store.
49.
50.
51. To create keyspace
CREATE KEYSPACE “KeySpace Name”
WITH replication = {'class': ‘Strategy name’,
'replication_factor' : ‘No.Of replicas’};
52. Details about existing Keyspaces
Describe keyspaces;
Select * from system.schema_keyspaces;
This gives more details
54. To create a column family or table by the name
“student_info”.
CREATE TABLE Student_Info ( RollNo int PRIMARY
KEY, StudName text, DateofJoining timestamp,
LastExamPercent double);
57. SELECT
To view the data from the table “student_info”.
SELECT * FROM student_info;
Select * from student_info where rollno in (1,2,3);
58. Index
T
o create an index on the “studname” column of the
“student_info” column family use the following
statement
CREATE INDEX ON student_info(studname);
Select * from student_info where StudName='Aviral';
59. Update
To update the value held in the “StudName” column of
the “student_info” column family to “David Sheen” for the
record where the RollNo column has value = 2.
Note: An update updates one or more column values for a
given row to the Cassandra table. It does not return
anything.
• UPDATE student_info SET StudName = 'Sharad' WHERE
RollNo = 3;
60. Delete
T
o delete the column “LastExamPercent” from the
“student_info” table for
the record where the RollNo = 2.
Note:Delete statement removes one or more columns
from one or more rows of a Cassandra table or
removes entire rows if no columns are specified.
DELETE LastExamPercent FROM student_info WHERE
RollNo=2;
61. Collections
• Cassandra provides collection types, used to group and
store data together in a column.
• E.g., grouping such a user's multiple email addresses.
• The values of items in a collection are limited to
64K.
• Collections can be used when you need to store the
following: Phone numbers of users and Email ids of
users.
62. Collections Set
• T
o alter the schema for the table “student_info” to
add a column “hobbies”.
ALTER TABLE student_info ADD hobbies set<text>;
UPDATE student_info SET hobbies = hobbies + {'Chess, Table
Tennis'} WHERE RollNo=4;
63. Collections List
• T
o alter the schema of the table “student_info” to
add a list column “language”.
ALTER TABLE student_info ADD language list<text>;
UPDATE student_info SET language = language + ['Hindi,
English'] WHERE RollNo=1;
64. Collections Map
• A map relates one item to another with a key-value pair.
Using the map type, you can store timestamp-related
information in user profiles.
• T
o alter the “Student_info” table to add a map
column “todo”.
• ALTER TABLE Student_info ADD todo map<timestamp,
text>;
65. Example
UPDATE student_info SET todo = { '2014-9-24':
'Cassandra Session', '2014-10-2 12:00' :
'MongoDB Session' } where rollno = 1;
66. Time To Live(TTL)
• Data in a column, other than a counter column, can
have an optional expiration period called TTL (time to
live).
• The client request may specify a TTL value for the
data. The TTL is specified in seconds.
67. Time To Live(TTL)
• CREATE TABLE userlogin(userid int primary key,
password text);
• INSERT INTO userlogin (userid, password) VALUES
(1,'infy') USING TTL 30;
• select * from userlogin;
68. Export to CSV
copy student_info( RollNo,StudName ,
DateofJoining, LastExamPercent) TO 'd:student.csv';
69. Import data from a CSV file
CREATE TABLE student_data ( id int PRIMARY KEY, fn text, ln
text,phone text, city text);
COPY student_data (id,fn,ln,phone,city) FROM
'd:cassandraDatastudent.csv';
70. Introduction to MapReduce Programming
(Revisit for details)
• In MapReduce Programming, Jobs (Applications) are
split into a set of map tasks and reduce tasks. Then these
tasks are executed in a distributed fashion on Hadoop
cluster.
• Each task processes small subset of data that has been
assigned to it. This way, Hadoop distributes the load
across the cluster.
• MapReduce job takes a set of files that is stored in
HDFS (Hadoop Distributed File System) as input.
71. Mapper
• The Map task takes care of loading, parsing,
transforming, and filtering.
• A mapper maps the input key-value pairs into a set of
intermediate key-value pairs.
• Maps are individual tasks that have the responsibility of
transforming input records into intermediate key-value
pairs. Each map task is broken into the following phases
• RecordReader
• Mapper/Maps
• Combiner
• partitioner
72. RecordReader
• RecordReader reads the data from inputsplit (record)
and converts them into key-value pair for the input to
the Mapper class.
73.
74. Maps
• Map is a user-defined function, which takes a series of
key-value pairs and processes each one of them to
generate zero or more key-value pairs.
• Map takes a set of data and converts it into another set
of data. Input and output are key-value pairs.
75. Combiner
• A combiner is a type of local Reducer that groups similar
data from the map phase into new set of key-value pair.
• It is not a part of the main MapReduce algorithm;
• it is optional (may be part of mapper/map).
• The main function of a Combiner is to summarize the
map output records with the same key.
76. Difference between Combiner and Reducer
• Output generated by combiner is intermediate data and
is passed to the reducer.
• Output of the reducer is passed to the output file on the
disk.
77.
78. Partitioner
• A partitioner partitions the key-value pairs of
intermediate Map-outputs.
• The Partitioner in MapReduce controls the partitioning
of the key of the intermediate mapper output.
• The partition phase takes place after the Map phase and
before the Reduce phase.
• The number of partitioner is equal to the number of
reducers. That means a partitioner will divide the data
according to the number of reducers. Therefore, the
data passed from a single partitioner is processed by a
single Reducer.
80. Shuffling and Sorting in Hadoop MapReduce
• The process by which the intermediate output
from mappers is transferred to the reducer is called
Shuffling.
• Intermediated key-value generated by mapper is sorted
automatically by key.
81.
82. Reduce
• The primary task of the Reducer is to reduce
a set of intermediate values (the ones that share
a common key) to a smaller set of values.
• The Reducer takes the grouped key-value paired
data as input and runs a Reducer function on each
one of them.
• Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a
wide range of processing.
• The output of the reducer is the final output,
which is stored in HDFS
83. RecordWriter (Output format)
• RecordWriter writes output key-value pairs from the
Reducer phase to output files.
• OutputFormat instances provided by the Hadoop are
used to write files in HDFS. Thus the final output of
reducer is written on HDFS by OutputFormat instances
using RecordWriter.