Your SlideShare is downloading. ×
Apache Cassandra
Overview and Basics
© Oleg Magazov
omagazov@t-online.de
Learning Targets

 Big Data introduction
 Understand driving forces behind NoSQL development
 Map known RDBMS concepts ...
Agenda

•
•
•
•
•
•
•

Big Data
NoSQL. Main Technologies
NoSQL. Products
Apache CassandraTM Features
Apache CassandraTM Ar...
Big Data
Origin

•
•

April 1998 John R. Mashey from SGI, Usenix
talk: “Big Data and the Next Wave of Infrastress”
Big Data refers ...
IDC Analysis
2020 Forecast Global

•
•

40 zettabytes data on Earth
5,247 GB of data for every man,
woman and child on earth in
2020
Cisco Forecasts
Some Facts...
Big Data Driving Forces

•
•
•
•

Continued growth of Internet usage, social networks,
and smartphones
The falling costs o...
Main Producer

•
•

Machine-generated data is a key factor behind expansion
Growth from 11% of the digital universe in 200...
The Analysis Gap

Currently only 3% of the potentially useful data is
tagged, and even less is analyzed
Storage Capacities

•
•
•
•

I/O for HDDs is time consuming
For a 1 TB with with transfer speed of 300 MB/s (SATA) it
take...
Random Seeks

•
•
•

Seek time is improving more slowly than transfer
rate
Random seeks are expensive
Inherent to most RDB...
Structure

•
•
•

Data is becoming increasingly semi-structured and
unstructured
Unstructured data is data without a schem...
Limitations of RDBMS

•
•
•
•
•
•

Up-front schema declaration is needed
Referential integrity is necessary
Use mainly B-T...
ACID

•
•
•
•

Atomicity
Consistency
Isolation
Durability
Isolation Levels

•
•
•
•

Read Uncommitted
Read Committed
Repeatable Read
Serializable
ACID in Destributed Systems

•
•

Two-phase commit (2PC)
Two-phase locking (2PL)
Roadmap

•
•
•
•
•

Parallel Processing
Sharding and shared-nothing architecture
Reliability through replication
Advanced ...
NoSQL
CAP Theorem
BASE

•

BASE - Basically Available Soft-state Eventually
consistency
R — Number of nodes that are read from
W — Number of...
Sharding
Sharding

•
•
•

Feature-based shard or functional segmentation
Key-based sharding
Lookup table
Setting Context

•
•
•
•

„The Google File System”, October 2003
“MapReduce: Simplified Data Processing on Large
Clusters”...
MapReduce

•
•
•
•
•
•

Created by Google
Parallel processing model
Data locality
Allows distributed processing on large d...
MapReduce

•
•

map(key1,value) -> list<key2,value2>
reduce(key2, list<value2>) -> list<value3>
Amazon Dynamo

•
•
•
•

“Dynamo: Amazon’s Highly Available Key/value Store”,
October 2007
Introduction of notion of eventu...
Amazon Dynamo

•
•
•
•

Masterless
Physical nodes are peers and organized into
a ring
Automatically partitioning mechanism...
Apache Hadoop

•
•
•

2004—Initial versions of Hadoop Distributed Filesystem
and Map-Reduce implemented
January 2006—Doug ...
NoSQL Features

•
•
•
•
•
•

Advocated horizontal scalability in favor of vertical
scalability
Promises linear scalability...
NoSQL Databases Classification

•
•
•
•

Sorted Ordered Column-Oriented
Stores
Key/Value Stores
Document Databases
Graph D...
Ordered Column-Oriented Stores
•

•

Store data sets (Column Families) as sections of
columns
• Set of key(column)/value p...
Column-Oriented Stores
Key/Value Stores

•
•
•

Idea
– HashMap – fast O(1) access
The key of a key/value pair is a unique value in the set
and ca...
Products
Document Databases

•
•
•

Keep documents as loosely structured sets of key/value
pairs, typically JSON (JavaScript Object...
Products
Graph Databases
Graph Databases

•
•
•
•
•

Use graph structures with nodes, edges, and properties
to represent and store data
Are based o...
Products
Apache CassandraTM
History

•
•
•
•
•

Originated at Facebook in 2007 to solve company’s
inbox search problem
July 2008, open source Google C...
Cassandra Features (Part I)

•
•
•
•
•

High availability
Linear and elastic scalability
Distributed and decentralized
Pee...
Cassandra Features (Part II)

•
•
•
•
•

Fault tolerance and built-in failure
detection
Tunable consistency
Supports basic...
Cassandra Features (Part III)

•
•
•
•

Thrift interface and an internal Java API
Clients for multiple Java, Python, Grail...
Cassandra in CAP Triangle
Architecture. Big Picture
Architecture Components. Part I

•
•
•
•
•

Consistent hashing
Virtual nodes
Gossip and failure detection
Hinted handoff
A...
Architecture Components. Part II

•
•
•
•
•

Ring topology
Staged Event-Driven Architecture (SEDA)
Compaction
Tombstones
M...
Architecture Components. Part III

•
•
•
•
•

Row and key caches
Bloom filters
Merkle trees
Compression
Atomic batches
Tunable Consistency

•
•

•

Replication Factor (RF)
Quorum
– R+W > RF
– Quorum = (RF/2) +1
Consistency for read and write...
Replication Strategy

•
•
•

SimpleStrategy
NetworkTopologyStrategy
Created for a keyspace with replica placement
strategy
Simple Strategy

•
•
•

For single data center clusters
First replica on a node determined by a partitioner
Additional rep...
Simple Strategy
2

3

1

2

1

2

3

1

3

4
1

2

3

3

1

2
Data Distribution and Replication

•

How does Cassandra data distribution
and replication work?
Consistent Hashing
Client Request Workflow
Network Topology

•
•
•

Data center - grouping of nodes configured together for
replication purposes
Rack - similar physi...
Cassandra Client API

•
•
•
•
•

Cassandra CLI, Thrift based
CQL3, native protocol
Cqlsh with Python dependency
Multiple l...
DataStax Java Driver

•
•
•
•
•
•

Works only with CQL3
Layered architecture
Relies on Netty to provide non-blocking I/O f...
Some Services

•
•
•
•
•

Daemon
Storage
Gossip
Messaging
Load Balancing
Data Model

•

RDBMS vs. Cassandra terminology
RDBMS View
Cassandra View
Cassandra vs. RDBMS (Part I)

•
•
•
•

No referential integrity
Doesn’t support joins
Limited SQL support
Denormalization
Cassandra vs. RDBMS (Part II)

•
•
•
•

Storing of collections in a field is possible
Row size is a design issue
Comparato...
Cassandra View
Keyspaces

•
•
•
•

Replication factor
Replica placement strategy
Column families
Usually one keyspace per application
Column Families

•
•
•
•
•

Serve as container for an ordered collection of
columns/rows
Are not equal to RDBMS tables
Col...
Column Families
Static Column Families

•
•
•

Use a relatively static set of column names
Are more similar to a relational database table...
Dynamic Column Families

•
•
•

Allow to pre-compute result sets and store them in a
single row for efficient data retriev...
Column

•
•
•

Row keys and column names can be any kind of byte array
Useful data can be stored in the key itself, not on...
Legacy: Super Columns
Composite Columns

•
•
•
•
•

Are used under the hood to store clustered rows
All the logical rows with the same partition...
Skinny Rows

•
•
•

Are like traditional RDBMS rows
Each row contains similar sets of column names
But all columns are opt...
Wide Rows

•
•
•
•

Have lots (eventually millions) of columns
Typically contain automatically generated names (like
UUIDs...
Practice Drive
Download and Install

•

•
•
•
•

Cassandra requires minimum version of Java 1.7 JDK
(http://www.oracle.com/technetwork/ja...
Create Schema
•
•
•
•
•
•

cassandra-cli -host localhost -port 9160
create keyspace TestsDataStore;
show keyspaces;
use Te...
Populate With Data

•
•
•
•
•
•
•
•
•

assume Cars keys as utf8;
set Cars['Cabrio']['make'] = 'bmw'
set Cars['Cabrio']['mo...
Data Manipulation

•
•
•

•
•
•
•
•

get Cars['Cabrio'];
get Cars['Cabrio']['make'];
update column family Cars with compar...
Agile Development with Cassandra

•
•
•
•

Facilitates agile development providing schema free
data model and query first ...
Use Cases

•
•
•
•
•

Large deployments
Lots of writes, statistics, and analysis
Geographical distribution
Very large data...
Some Users
Upcoming SlideShare
Loading in...5
×

Apache Cassandra training. Overview and Basics

870

Published on

Day one from four days Apache Cassandra seminar for a large international telecommunication company. More: http://www.ukrmaks-soft.de/news12-2013.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
870
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Apache Cassandra training. Overview and Basics"

  1. 1. Apache Cassandra Overview and Basics © Oleg Magazov omagazov@t-online.de
  2. 2. Learning Targets  Big Data introduction  Understand driving forces behind NoSQL development  Map known RDBMS concepts to corresponding NoSQL paradigms  Get overview about Apache CassandraTM architecture  Get overview about CassandraTM data model  Get first experience of CassandraTM packaging and CLI
  3. 3. Agenda • • • • • • • Big Data NoSQL. Main Technologies NoSQL. Products Apache CassandraTM Features Apache CassandraTM Architecture Apache CassandraTM Data Modeling Apache CassandraTM CLI
  4. 4. Big Data
  5. 5. Origin • • April 1998 John R. Mashey from SGI, Usenix talk: “Big Data and the Next Wave of Infrastress” Big Data refers to huge data volumes, continuously increasing data sources, velocity of data generation, data analysis and related technology solutions
  6. 6. IDC Analysis
  7. 7. 2020 Forecast Global • • 40 zettabytes data on Earth 5,247 GB of data for every man, woman and child on earth in 2020
  8. 8. Cisco Forecasts
  9. 9. Some Facts...
  10. 10. Big Data Driving Forces • • • • Continued growth of Internet usage, social networks, and smartphones The falling costs of the technology for information creation, capturing and storage Migration from analog TV to digital TV Growth of machine-to-machine communication
  11. 11. Main Producer • • Machine-generated data is a key factor behind expansion Growth from 11% of the digital universe in 2005 to more than 40% in 2020 – – – – – Machine logs RFID readers Sensor networks Vehicle GPS traces Retail transactions
  12. 12. The Analysis Gap Currently only 3% of the potentially useful data is tagged, and even less is analyzed
  13. 13. Storage Capacities • • • • I/O for HDDs is time consuming For a 1 TB with with transfer speed of 300 MB/s (SATA) it takes ~ 1 h SSD are 5 faster in average SSD are more expensive
  14. 14. Random Seeks • • • Seek time is improving more slowly than transfer rate Random seeks are expensive Inherent to most RDBMS
  15. 15. Structure • • • Data is becoming increasingly semi-structured and unstructured Unstructured data is data without a schema Semi-structured – no conformity to relational databases structures – self-describing, containing tags or structure related markers
  16. 16. Limitations of RDBMS • • • • • • Up-front schema declaration is needed Referential integrity is necessary Use mainly B-Tree indexes Non-Liniar scaling Are build around OLTP and OLAP approaches Many solutions are really expensive
  17. 17. ACID • • • • Atomicity Consistency Isolation Durability
  18. 18. Isolation Levels • • • • Read Uncommitted Read Committed Repeatable Read Serializable
  19. 19. ACID in Destributed Systems • • Two-phase commit (2PC) Two-phase locking (2PL)
  20. 20. Roadmap • • • • • Parallel Processing Sharding and shared-nothing architecture Reliability through replication Advanced algorithms for parallel processing Advanced storage structures addressing seek problem
  21. 21. NoSQL
  22. 22. CAP Theorem
  23. 23. BASE • BASE - Basically Available Soft-state Eventually consistency R — Number of nodes that are read from W — Number of nodes that are written to N — Total number of nodes in the cluster R + W = 2N – ACID complaint
  24. 24. Sharding
  25. 25. Sharding • • • Feature-based shard or functional segmentation Key-based sharding Lookup table
  26. 26. Setting Context • • • • „The Google File System”, October 2003 “MapReduce: Simplified Data Processing on Large Clusters”, December 2004 “Bigtable: A Distributed Storage System for Structured Data”, November 2006 “The Chubby Lock Service for Loosely-Coupled Distributed Systems”, November 2006
  27. 27. MapReduce • • • • • • Created by Google Parallel processing model Data locality Allows distributed processing on large data sets in cluster Derives its ideas from functional programming Works with semi-structured data
  28. 28. MapReduce • • map(key1,value) -> list<key2,value2> reduce(key2, list<value2>) -> list<value3>
  29. 29. Amazon Dynamo • • • • “Dynamo: Amazon’s Highly Available Key/value Store”, October 2007 Introduction of notion of eventual consistency There could be small intervals of inconsistency between replicated nodes Eventual consistency does not mean inconsistency
  30. 30. Amazon Dynamo • • • • Masterless Physical nodes are peers and organized into a ring Automatically partitioning mechanism Written in Java
  31. 31. Apache Hadoop • • • 2004—Initial versions of Hadoop Distributed Filesystem and Map-Reduce implemented January 2006—Doug Cutting joins Yahoo! February 2006—Apache Hadoop project officially started
  32. 32. NoSQL Features • • • • • • Advocated horizontal scalability in favor of vertical scalability Promises linear scalability Uses new advanced technologies for parallel processing Often uses custom file system implementation or advanced storage techniques Optionally schema-free No the concept of locking or locking is a choice by design
  33. 33. NoSQL Databases Classification • • • • Sorted Ordered Column-Oriented Stores Key/Value Stores Document Databases Graph Databases
  34. 34. Ordered Column-Oriented Stores • • Store data sets (Column Families) as sections of columns • Set of key(column)/value pairs • Sorted by row-key (primary key) Units of data are sorted and ordered on the basis of the row-key
  35. 35. Column-Oriented Stores
  36. 36. Key/Value Stores • • • Idea – HashMap – fast O(1) access The key of a key/value pair is a unique value in the set and can be easily looked up to access the data Eventual consistency
  37. 37. Products
  38. 38. Document Databases • • • Keep documents as loosely structured sets of key/value pairs, typically JSON (JavaScript Object Notation) Treat document as a whole and avoid splitting a document into its constituent name/value pairs Allow indexing of documents on the basis of not only its primary identifier but also its properties
  39. 39. Products
  40. 40. Graph Databases
  41. 41. Graph Databases • • • • • Use graph structures with nodes, edges, and properties to represent and store data Are based on graph theory Are faster for associative data sets Don’t not require expensive join operations Best suitable for graph-like queries
  42. 42. Products
  43. 43. Apache CassandraTM
  44. 44. History • • • • • Originated at Facebook in 2007 to solve company’s inbox search problem July 2008, open source Google Code project March 2009, Apache Incubator project February 2010, top level Apache Project November 2013, version 2.0.3 was released
  45. 45. Cassandra Features (Part I) • • • • • High availability Linear and elastic scalability Distributed and decentralized Peer-to-Peer No single point of failure
  46. 46. Cassandra Features (Part II) • • • • • Fault tolerance and built-in failure detection Tunable consistency Supports basic subset of SQL via CQL A command-line access to the store Basic security support
  47. 47. Cassandra Features (Part III) • • • • Thrift interface and an internal Java API Clients for multiple Java, Python, Grails, PHP, .NET., Ruby, Scala Support of JMX interfaces Built-in benchmarking • Hadoop and MapReduce integration
  48. 48. Cassandra in CAP Triangle
  49. 49. Architecture. Big Picture
  50. 50. Architecture Components. Part I • • • • • Consistent hashing Virtual nodes Gossip and failure detection Hinted handoff Anti-Entropy and read repair
  51. 51. Architecture Components. Part II • • • • • Ring topology Staged Event-Driven Architecture (SEDA) Compaction Tombstones Memtables, SSTables, and commit logs
  52. 52. Architecture Components. Part III • • • • • Row and key caches Bloom filters Merkle trees Compression Atomic batches
  53. 53. Tunable Consistency • • • Replication Factor (RF) Quorum – R+W > RF – Quorum = (RF/2) +1 Consistency for read and write on operation basis
  54. 54. Replication Strategy • • • SimpleStrategy NetworkTopologyStrategy Created for a keyspace with replica placement strategy
  55. 55. Simple Strategy • • • For single data center clusters First replica on a node determined by a partitioner Additional replicas are placed on the next nodes clockwise in the ring
  56. 56. Simple Strategy 2 3 1 2 1 2 3 1 3 4 1 2 3 3 1 2
  57. 57. Data Distribution and Replication • How does Cassandra data distribution and replication work?
  58. 58. Consistent Hashing
  59. 59. Client Request Workflow
  60. 60. Network Topology • • • Data center - grouping of nodes configured together for replication purposes Rack - similar physical grouping of nodes Snitch maps IPs to racks and data centers – All nodes in a cluster must use the same snitch configuration
  61. 61. Cassandra Client API • • • • • Cassandra CLI, Thrift based CQL3, native protocol Cqlsh with Python dependency Multiple languages drivers Java: CQL3 via DataStax 1.0 driver
  62. 62. DataStax Java Driver • • • • • • Works only with CQL3 Layered architecture Relies on Netty to provide non-blocking I/O for providing a fully asynchronous architecture Connection pooling, node discovery Automatic failover, load balancing Prepared statements are supported
  63. 63. Some Services • • • • • Daemon Storage Gossip Messaging Load Balancing
  64. 64. Data Model • RDBMS vs. Cassandra terminology
  65. 65. RDBMS View
  66. 66. Cassandra View
  67. 67. Cassandra vs. RDBMS (Part I) • • • • No referential integrity Doesn’t support joins Limited SQL support Denormalization
  68. 68. Cassandra vs. RDBMS (Part II) • • • • Storing of collections in a field is possible Row size is a design issue Comparators for column families Ordering is the design issue
  69. 69. Cassandra View
  70. 70. Keyspaces • • • • Replication factor Replica placement strategy Column families Usually one keyspace per application
  71. 71. Column Families • • • • • Serve as container for an ordered collection of columns/rows Are not equal to RDBMS tables Column families have to be defined, the columns shouldn't Entries in column families are grouped by row key All data for a single row must fit on a single machine in the cluster
  72. 72. Column Families
  73. 73. Static Column Families • • • Use a relatively static set of column names Are more similar to a relational database table Have metadata definition for individual columns
  74. 74. Dynamic Column Families • • • Allow to pre-compute result sets and store them in a single row for efficient data retrieval Defines the type information for column names and values (comparators and validators) Actual column names and values are set by the application when a column is inserted
  75. 75. Column • • • Row keys and column names can be any kind of byte array Useful data can be stored in the key itself, not only in the value 2 billion columns per (physical) row
  76. 76. Legacy: Super Columns
  77. 77. Composite Columns • • • • • Are used under the hood to store clustered rows All the logical rows with the same partition key get stored as a single, physical wide row Can be created and queried using CQL 3 Support range queries Substitute Super Columns
  78. 78. Skinny Rows • • • Are like traditional RDBMS rows Each row contains similar sets of column names But all columns are optional
  79. 79. Wide Rows • • • • Have lots (eventually millions) of columns Typically contain automatically generated names (like UUIDs or timestamps) Are used to store lists of things All the logical rows with the same partition key get stored as a single, physical row
  80. 80. Practice Drive
  81. 81. Download and Install • • • • • Cassandra requires minimum version of Java 1.7 JDK (http://www.oracle.com/technetwork/java/javase/downloa ds/index.html) Download from http://cassandra.apache.org/download/ Extract in some directory Customize cassandra.yaml in the /conf directory Start with bin/cassandra -f
  82. 82. Create Schema • • • • • • cassandra-cli -host localhost -port 9160 create keyspace TestsDataStore; show keyspaces; use TestsDataStore; create column family Cars with comparator = UTF8Type; update column family Cars with column_metadata = [ {column_name: make, validation_class: UTF8Type}, {column_name: model, validation_class: UTF8Type}, ];
  83. 83. Populate With Data • • • • • • • • • assume Cars keys as utf8; set Cars['Cabrio']['make'] = 'bmw' set Cars['Cabrio']['model'] = '640i'; set Cars['Corolla']['make'] = 'toyota'; set Cars['Corolla']['model'] = 'le'; set Cars['fit']['make'] = 'honda'; set Cars['fit']['model'] = 'fit sport'; set Cars['focus']['make'] = 'ford'; set Cars['focus']['model'] = 'sel';
  84. 84. Data Manipulation • • • • • • • • get Cars['Cabrio']; get Cars['Cabrio']['make']; update column family Cars with comparator=UTF8Type and column_metadata=[{column_name: make, validation_class: UTF8Type, index_type: KEYS}, {column_name: model, validation_class: UTF8Type}]; del Cars['Cabrio']['bmw']; drop column family Cars; drop keyspace TestsDataStore; show keyspaces;
  85. 85. Agile Development with Cassandra • • • • Facilitates agile development providing schema free data model and query first paradigm Makes TDD easier providing build in test tools Is built around multiple design patterns, facilitating Clean Code approach Decentralized nature makes distributed work easier (including geographical distribution)
  86. 86. Use Cases • • • • • Large deployments Lots of writes, statistics, and analysis Geographical distribution Very large data volumes High reliability requirements for data storage
  87. 87. Some Users

×