2. Agenda
• What is Cassandra
• Main features and known issues
• Demo : Use Cassandra for OLAP
2
3. What is Cassandra
• Apache Cassandra is a
• Free for download and install
• Open-source still active on Github and JIRA
• NoSQL database management system
• designed to be distributed
Handle large amounts of data
Across many commodity servers
Providing high availability with no single point of failure
3
4. Cassandra Query Language (CQL)
• CQL is
• a simple interface for accessing Cassandra
• as an alternative to the traditional Structured Query Language (SQL).
• CQL provides native syntaxes for collections and other common
encodings Language drivers are available for Java (JDBC), Python
(DBAPI2), Node.JS (Helenus), Go (gocql) and C++.
4
5. Something special
• Scalability
• MapReduce support
• Distributed
• Supports replication and multi data center replication
• Fault-tolerant
• consistency
5
8. Distributed : How to store data
• Key features of Cassandra’s distributed architecture are specifically tailored
for multiple-data center deployment.
• Cassandra operates by dividing all data evenly around a cluster of nodes,
which can be visualized as a ring. Nodes generally run on commodity
hardware. Each Cassandra node in the cluster is responsible for and
assigned a token range (which is essentially a range of hashes defined by a
partitioner).
• Each update or addition of data contains a unique row key (also known as
a primary key). The primary key is hashed to determine a replica (or node)
responsible for a token range inclusive of a given row key. The data is then
stored in the cluster n times (where n is defined by the
keyspace’s replication factor), or once on each replica responsible a given
query’s row key.
8
9. Distributed : How to read / write data
• A read request is processed using eventually consistency, and the keyspace was
created with a “replication factor” of 3, 2 of the 3 replicas for the requested data
would be contacted, their results merged, and a single result returned to the
client.
• A write requests, the coordinator node will send a write requests with all
mutated columns to all replica nodes for a given row key.
• First added to the commit log, which ensures durability of the transaction.
• Next, it is also added to the memtable. A memtable is a bounded in memory write-back
cache that contains recent writes which have not yet been flushed to an SSTable (a
permanent, immutable, and serialized on disk copy of the tables data).
• When updates cause a memtable to reach it’s configured maximum in-memory size, the
memtable is flushed to an immutable SSTable, persisting the data from the memtable
permanently on disk while making room for future updates.
• In the event of a crash or node failure, events are replayed from the commit log, which
prevents the loss of any data from memtables that had not been flushed to disk prior to an
unexpected event such as a power outage or crash.
9
10. Something tricky
• Cassandra is not row level consistent :
• When inserts and updates into the table
o affect the same row ; processed at approximately the same time
o may affect the non-key columns in inconsistent ways
o One update may affect one column while another affects the other
o resulting in sets of values within the row that were never specified or intended
oWhen update , Cassandra do not check the data is conflict or not !
10
11. Data model
• The most important thing to know in Cassandra data modeling: The primary key
• The simplest form :
• The first element in our PRIMARY KEY is what we call a partition key.
• The partition key has a special use in Apache Cassandra beyond showing the uniqueness of the
record in the database. The other purpose, and one that very critical in distributed systems, is
determining data locality.
• Added more elements :
• All columns listed after the partition key are called clustering columns.
• This is where Cassandra take a huge break from relational databases. Where the partition key is
important for data locality, the clustering column specifies the order that the data is arranged
inside the partition. The way we read this is left to right:
• Item one is the partition key
• Item two is the first clustering column.
• Item three is the second clustering column.
• After inserting data, you should expect your SELECT to return data in the
ascending/descending order of the item two for a single partition.
11
12. Demo for flight “delay”
• Maybe we all experienced being late for catching a flight or running
like a crazy in the airport transfer to next flight because the previous
one is delayed.
• Did you even notice sometimes your flight even fly earlier than
scheduled ? How often might this happen ?
• How could I know this airlines is “always late” or this transfer airport
always crowded so I can take a walk even the first flight is one hour
later than scheduled while I am booking the tickets?
• If we know where and how to look those data, and avoid some
problem if it might have a very high possibility to happen ?
12
13. Dataset
• Source : kaggle dataset flight-delay
• flights.csv for USA 2015 all the unscheduled flight ; airlines.csv; airports.csv
13
14. CQL
• Use primary key / clustering key
• No join
• Allow FILTERING
• Give your more controls :
• User-defined function (UDF)
• User-defined aggregate function (UDA)
14
15. Start Cassandra (Mac OS) Import the CSV data
• Start Cassandra first in terminal : /usr/local/apache-cassandra-3.10/bin/cassandra -f
• Then start cqlsh in another tab of terminal : /usr/local/apache-cassandra-3.10/bin/cqlsh
• Time for fun in cqlsh:
CREATE KEYSPACE flight WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
USE flight;
CREATE TABLE flight
(YEAR SMALLINT,MONTH SMALLINT,
….
WEATHER_DELAY TEXT,
PRIMARY KEY (AIRLINE, destination_airport, origin_airport));
COPY flight (YEAR,MONTH,DAY, …..,WEATHER_DELAY)
FROM '/Users/nanazhu/Downloads/flights.csv '
WITH header=true and NULL = 'NULL' ;
15
16. Query: I want to flight from JFK to LAX ,which
airline/what time should be double-checked?
16
Request
Development of a small project.
Students are strongly encouraged to propose their own idea for projects. As a suggestion, they can refer to (and also select from) the following list of tools.
The project connected to a tool consists, for example,
in studying the logical data model(s) adopted by the tool,
the native storage data structure it uses,
the query language it provides,
and highlighting further distinguishing features.
Also, a demonstration of the basic use of the tool through one or more examples is required.
Presentation connected to projects (possibly through slides) should last around 20 minutes (including the demo).
Why we need hashing ?
Think about you are searching on some data might contains “Disney” and you don’t know which node(s) has this data (imagine you have to turn over every stone to find it)
DO you really think it is good idea to ask every node “do you have this data “Disney” ?
Solution : Hashing the primary key and directly goes to the node which has this data and do the rest operation !