1. BỘ GIÁO DỤC VÀ ĐÀO TẠO
TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN TP.HCM
KHOA CÔNG NGHỆ THÔNG TIN
NoSQL
Column-Family Stores
GVHD: Ts. Nguyễn Trần Minh Thư
Nhóm 07:
1. 19C11015 - Đỗ Huy Gia Cát
2. 21C12003 - Đào Thanh Danh
3. 21C11026 - Nguyễn Thành Thái
Báo cáo môn Các hệ cơ sở dữ liệu nâng cao
1
NoSQL - Not Only SQL
2. CONTENTS
• Column-Family Stores NoSQL
o Overview
o Column-Family Databases
• Cassandra's Structure and Features
• Compare Colum-Family Data Store with others
• Query features
• Expand analyse
• Scaling
• Some compare Cassandra and HBase
• Apply suitable usecases
2
4. Wide Column / Column Family Database
• Column-family stores are databases in which data is stored by key-
value mapping and values group into multiple column families, with
each being a map of data
• Keyword comparison between RDBMS and Cassandra
4
RDBMS Cassandra
Database instance Cluster
Database Keyspace
Table Column Family
Row Row
Column (same for all rows) Column (can be different per row)
6. Column Family Unit Structure Storage
• Column: the basic storage unit,
consist of a name-value pair with
the name also acts as the key, and
stored with a timestamp value
• Super column: column
whose value is a map of columns
7
7. Column Family Unit Structure Storage
• Standard column family: column
family where all columns are
simple columns
• Super column family: column
family where exists at least one
super column
8
9. Consistency
• Cassandra stores replicas on multiple nodes to ensure reliability
• Cassandra provides three consistency levels:
ONE: Only need one of the nodes to respond to the request, good for
high write performance requirements
QUOROM: Ensures that majority of the node respond to the request
ALL: All nodes will have to respond to the requests
• If a node is down, the data will be stored later when it comes back via
hints (hinted handoff) or repair command.
10
10. Transactions
• In Cassandra, transactions are atomic and isolated
• Atomic: inserted or updating columns in a row is treated as a write
operation
• Isolation: writes to a row are isolated to client and not visible to other
uses until completion
11
11. Availability
• Availability is governed by the formula
(R + W) > N
R, W: minimum number of nodes read/write request is successfully
responded; N: number of replicas of data
• Keyspaces should be set up depending on your need – higher
availability for read or write
12
12. Scaling
• Cassandra handles scaling by adding additional nodes to the cluster
• Allows clusters to be scaled on the fly without operations => maxium
uptime
13
13. 15
Database - Open-source NoSQL - Column Family
- Store data no relationship on column-family model
Scalability - Scalabilitiable by increasing nodes
Replication - Replica data on multi node
14. 16
Infrastructure Design independence, can integrate
with DBMS other and Storm, Hadoop
Base on Hadoop, can integrate with
Zookeeper
HBase master, HBase data node,
name node
Support Support ordered partitioning Not-support ordered partitioning
Node Multi seed node in clusster Node master monitoring/coordinator
nodes
Query language Cassandra Query Language – CQL
Cassandra Query Language Shell -
CQLSH
Only support HBase shell
20. Cassandra Query Language
• Built-In Data Type: boolean, int, bigint, variant, float, double, decimal,
ascii, varchar, text, timestamp, blob, inet, timeuuid, uuid,…
• Collection Data Type: LIST, SET, MAP
• User-Defined Data Type
22
21. Cassandra Query Language
• User-Defined Data Type
CREATE TYPE <keyspace>.<data type>
(variable,variable)
CREATE TYPE records (
name text,
branch text,
phone int,
city text,
id set<int>
);
23
22. Cassandra Query Language
SELECT Clause, WHERE Clause & ORDERBY
INSERT INTO <table name>
(<field name 1>,<field name 2>,<field name 3>.,...)
VALUES ('value 1','value 2','value 3',....)
USING <update parameter>;
UPDATE <table name> USING <update parameter>
SET <field name 1>=< value 1>,
< field name 2>=< value 2>,
< field name 3>=<value 3>, .....
WHERE <field>=<value>; 24
23. Cassandra Query Language
DELETE <table name>
USING <update parameter>
WHERE <identifier>
BEGIN BATCH
//different data manipulation command syntax -> INSERT, UPDATE
// DELETE
APPLY BATCH;
25
24. Cassandra Query Language
• Advanced Queries and Indexing
• CREATE INDEX <field name> ON <table name>
Indexes are implemented as bit-mapped indexes and perform well for
low-cardinality column values.
• USE, CREATE, ALTER, DROP, TRUNCATE,...
26
27. Suitable Use Cases
• A great choice to store event information, such as application state or errors
encountered by the application
• Content Management Systems, Blogging Platforms
=> store blog entries with tags, categories, links, and trackbacks
• Count and categorize visitors of a page to calculate analytics
• Data for specific time -> as ad banners on a website
29
28. When Not to Use
• Systems that require ACID transactions for writes and reads
• The database to aggregate the data using queries (such as SUM or
AVG)
• Sample product prototypes or initial tech spikes
30
Bản init release đầu tiên của Bigtable Google năm 2005 đặt nền móng
Apache Cassandra® is an open-source, Developed at Facebook, Cassandra was open-sourced 2008, after Apache continue develop 2009
HBase modeled after Google's Bigtable and written in Java
allow you to store data with keys mapped to values and the values grouped into multiple column families
Each column is a tuple (triplet) consisting of a column name, a value, and a timestamp. In a relational database table, this data would be grouped together within a table with other non-related data.
sự kết hợp của key-value và dạng table
thuật toán Consistent Hashing thì mỗi node sẽ được cấp phát 1 token, và dựa vào token này sẽ phân phối dữ liệu đến từng node.
Gossip Protocol giao thức truyền thông giữa các node trong cluster. Cơ bản là truyền thông P2P.
kiến trúc mạng ngang hàng (Peer - to - Peer) tất cả các node máy chủ trong hệ thống đều có vai trò như nhau, không có master -> giảm thiểu sự cố sập master sập cả hệ thống -> master-slave truyền thống.
SimpleStrategy -> clockwise direction in the Node ring.
NetworkTopologyStrategy -> replicate multi data center in a cluster
SimpleStrategy -> clockwise direction in the Node ring.
NetworkTopologyStrategy -> replicate multi data center in a cluster
Blob byte data
Ascii string for ASCII; text UTF8
Bigint -2^32 to 2^32
Counter số nguyên 64bit, ko có lệnh insert trong bảng với các column counter, chỉ update có thể sử dụng như việc tăng hay giảm.. không thể lập chỉ muc
inet biểu diễn chuỗi IPv4 hoặc IPv6
timestamp yyyy-mm-dd HH:mm, yyyy-mm-dd HH:mm:ss
SET sorted
USING <update parameter> -optional> USING TTL 86400;
insert và update dựa trên rowkey giống với primary key nhưng insert duplicate thì không báo lỗi mà nhận dạng theo timestamp trong version mới nhất gần với timestamp hiện tại nhất
USING <update parameter> -optional> USING TTL 86400;
thousands of companies have adopted it, including Apple, Instagram, Uber, Spotify, Twitter, Cisco, Rackspace, eBay, and Netflix
Comments can be either stored in the same row or moved to a different keyspace;
similarly, blog users and the actual blogs can be put into different column families.
the cost may be higher for query change as compared to schema change.?
Cassandra không hỗ trợ nhiều cho việc tính toán trên storage, nó không hỗ trợ các hàm sum, group, join, max, min