BigData Developers MeetUp

Evaluating Apache Cassandra as a Cloud Database
Christian Johannsen, Solutions Engineer

About me
2
Christian Johannsen
Solutions Engineer @ DataStax
Twitter: @cjohannsen81

What is a cloud database not?
3
• A Cloud database is not simply taking a traditional RDBMS and
running it in a Cloud provider’s environment

What is a cloud database?
4
• A cloud database is a database optimised to handle the
infrastructure limitations and features of running on
shared, geographically dispersed infrastructure.
• Some of them are: Outages, noisy neighbours, security,
variable networks etc…

Key attributes of a cloud database
5
• Transparent Elasticity
• being able to add and subtract nodes (physical or virtual machines) with
load balancing when the underlying application and business demands it.
• Transparent Scalability
• addition of nodes increases both (1) performance throughput; (2) ability to
handle Big Data and maintain high performance
• High Availability
• always up; no single point of failure
• Easy Data Distribution
• able to span multiple geographies, data centers, and cloud provider
zones. Can read/write to any node

6
Key attributes of a cloud database
• Support for multiple data types
• Able to handle structured, semi-structured, and unstructured data.
• Easy to manage
• easy to administer a logical database across many nodes
• Lower Cost
• It shouldn’t break the bank and you shouldn’t have to have a massive
upfront cost
• Multiple Infrastructure Support
• It should support being deployed on multiple cloud providers (public and
private) as well as traditional infrastructure
• Security
• Data should be secure in flight, at rest and audited.

Key Attributes at a glance
7
1. Transparent Elasticity
2. Transparent Scalability
3. High Availability
4. Easy Data Distribution
5. Data Redundancy
6. Support for multiple data types
7. Easy to manage
8. Lower Cost
9. Multiple Infrastructure Support
10. Security

Apache Cassandra
9
• Cassandra is a massively scalable, open source, NoSQL, distributed
database built for modern, mission-critical online applications.
• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable
• Masterless with no single point of failure
• Distributed, data centre and rack aware
• 100% uptime
• Predictable scaling Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Apache Cassandra Overview
• Cassandra was designed with the understanding that system/hardware failures
10
can and do occur
• Peer-to-peer, distributed system
• All nodes the same - masterless with no single point of failure
• Data partitioned among all nodes in the cluster
• Custom data replication to ensure fault tolerance
• Tunable data consistency
• Data Center and Rack aware
• Familiar SQL-Like language – CQL
• More capacity? Add a server!
• More throughput? Add a server!

Cassandra - Locally Distributed
12
• Replication factor (RF): How many copies of your
data?
• RF = 3 in this example
• Each node is storing 60% of the clusters total data
i.e. 3/5
Handy Calculator: http://www.ecyrd.com/cassandracalculator/
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
• Client reads or writes to any node
• Node coordinates with others
• Data read or replicated in parallel
Node 2
2nd copy

Cassandra - Rack/Zone aware
Rack 1
13
Node 1
1st copy
Rack 2 Rack 2
Node 4
Node 2
Node 3
2nd copy
Rack 3
Rack 1
Node 5
3rd copy
• Cassandra is aware of which rack or
zone each nod resides in
• It will attempt to place each data copy
in a different rack
• RF=3 in this example

Cassandra - DC/Region aware
14
• Active Everywhere – reads/writes in multiple data centres
• Client writes local
• Data syncs across WAN
• Replication Factor per DC
• Different number of nodes per
data center
DC: USA DC: EUROPE
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy

Cassandra - Tuneable Consistency
15
• Consistency Level (CL)
• Client specifies per operation
• Handles multi-data center operations
ALL = All replicas ack
QUORUM = > 51% of replicas ack
LOCAL_QUORUM = > 51% in local DC ack
ONE = Only one replica acks
Plus more…. (see docs)
Blog: Eventual Consistency != Hopeful Consistency
http://planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-consistency-
by-christos-kalantzis/
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
500 μs ack
12 μs ack

Cassandra - Node failure
16
• A single node failure shouldn’t bring failure.
• Replication Factor + Consistency Level = Success
• This example:
• RF = 3
• CL = QUORUM
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
12 μs ack
>51% ack – so request is a success

Cassandra - Node Recovery
17
• When a write is performed and a replica node for the row is unavailable the
coordinator will store a hint locally (3 hours)
• When the node recovers, the coordinator replays the missed writes.
• Note: a hinted write does not count the consistency level
• Note: you should still run repairs across your cluster
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Stores Hints while Node 3 is offline

Cassandra Rack/Zone Failure
Rack 1
18
• Cassandra will place the data in as many
different racks or availability zones as it can.
• This example:
• RF = 3
• CL = QUORUM
• AZ/Rack 2 fails
• Data copies still available in Node 1 and
Node 5
• Quorum can be honored i.e. > 51% ack
request is a success
Node 1
1st copy
Rack 2 Rack 2
Node 4
Node 2
Node 3
2nd copy
Rack 3
Rack 1
Node 5
3rd copy

Operational Simplicity
19
• Cassandra is a complete product – there is not a multitude
of components to install, set-up and monitor.
• Extremely simple to administer and deploy
• Backups are instantaneous and simple to restore
• Supports snapshots, incremental backups and point-in-time recovery.
• Cassandra can handle non-uniform hardware and disks.
o This enables the mixing of solid state and spinning disks in a single cluster and pinning
tables to workload-appropriate disks.
• No downtime is required in Cassandra for upgrades or
adding/removing servers from the cluster.

What ´s up with DataStax?
20

DataStax at a glance
21
330+
Employees Percent Customers
Founded in April 2010
~25 500+
Santa Clara, Austin, New York, London, Sydney

DataStax delivers value
22
Certified,
Enterprise-ready
Cassandra
Security Analytics Search Visual
Monitoring
Management
Services
In-Memory
Dev. IDE &
Drivers
Professional
Services
Support &
Training
Commercial
Confidence
Enterprise
Functionality

Enterprise Integrations
23
• DataStax adds Enterprise Features like: Hadoop, Solr,
Spark

DataStax OpsCenter
24
• DataStax OpsCenter is a browser-based, visual management and
monitoring solution for Apache Cassandra and DataStax Enterprise
• Functionality is also exposed via HTTP APIs

OpsCenter - New Cluster Example
25
A new, 10-node Cassandra (or A new, 10-node DSE cluste Hr awdiothop O) cplsuCsteern wteitrh r OunpnsCinegn toenr rAunWnSin gin in 3 3 m miinnuutteess……
1 2 3 Done

OpsCenter 5.0
26
• Manage multiple clusters and nodes
• Add and remove nodes
• Administer individual nodes or in bulk
• Configure clusters
• Perform rolling restarts
• Automatically repair data
• Rebalance data
• Backup management
• Capacity planning

DataStax Office Demo
27
• 32 Raspberry Pi´s
• 16 per DataStax Enterprise 4.5 Cluster
• Managed in OpsCenter 5.0
• “Red Button” downs one DataCenter
• Not the Performance-Demo but
• Availability
• Commodity Hardware

Native Drivers
28
• Different Native Drivers available: Java, Python etc.
• Load Balancing Policies (Client Driver receives Updates)
• Data Centre Aware
• Latency Aware
• Token Aware
• Reconnection policies
• Retry policies
• Downgrading Consistency
• Plus others..
• http://www.datastax.com/download/clientdrivers

CQL
29
• Cassandra Query Language
• CQL is intended to provide a common, simpler and easier to use
interface into Cassandra - and you probably already know it!
• e.g. SELECT * FROM users
• Usual statements:
• CREATE / DROP / ALTER TABLE / SELECT

CQL Basics
CREATE KEYSPACE league WITH REPLICATION = {‘class’:’NetworkTopologyStrategy’, ‘DataCentre1’:3,
‘DataCentre2’: 2};
30
USE league;
CREATE TABLE teams (
team_name varchar,
player_name varchar,
jersey int,
PRIMARY KEY (team_name, player_name)
);
SELECT * FROM teams WHERE team_name = ‘Mighty Mutts’ and player_name = ‘Lucky’;
INSERT INTO teams (team_name, player_name, jersey) VALUES ('Mighty Mutts',’Felix’,90);

DevCenter 1.1
32
• Visual Query Tool for Developers and Administrators
• Easily create and run Cassandra Queries
• Visually navigate database objects
• Context-based suggestions

Do you remember the facts?
33

Key Attributes of a cloud database
34
1. Transparent Elasticity
2. Transparent Scalability
3. High Availability
4. Easy Data Distribution
5. Data Redundancy
6. Support for multiple data types
7. Easy to manage
8. Lower Cost
9. Multiple Infrastructure Support
10. Security

Transparent Elasticity
• Nodes can be added and removed from Cassandra online, with no
35
downtime being experienced.
1
2
3
4
6
5
1
7
10
4
2
3
5
6
8
9
11
12

Transparent Scalability
36
• Addition of Cassandra nodes increases performance linearly and
ability to manage TB’s-PB’s of data.
1
2
3
4
6
5
1
7
10
4
2
3
5
6
8
9
11
12
Performance
throughput = N
Performance
throughput = N x 2

High Availability
37
• Cassandra, with its peer-to-peer masterless architecture has no
single point of failure.
• 100% uptime is the norm!

Easy Data Distribution
38
• Cassandra allows a single logical database to span 1-N Data
Centers that are geographically dispersed.
• Also supports a hybrid Cloud implementation.

Data Redundancy
39
• Cassandra allows for customisable data redundancy so that data
is completely protected.
• Also supports rack/zone awareness (data can be replicated
between different racks to guard against machine/rack failures).

Support for multiple data types
40
• Cassandra’s data model (based on Google’s Bigtable) allows a
user to store structured, semi-structured, and unstructured data
with ease.
Portfolio Keyspace
Customer Table
ID Name SSN DOB

Easy to manage
41
• OpsCenter, AMI installers etc.. allow you to install and configure
an entire multi-node Cloud implementation in minutes.
• All can be managed and monitored via Web-based console or via
RESTful APIs.

Lower Cost
42
• Cassandra is open source software and is freely available.
• Commercial/advanced versions of Cassandra are available from
DataStax along with support and additional services.

Multiple Infrastructure Support
43
• Cassandra is supported on popular Cloud provider platforms and
operating systems.

Security
44
• Cassandra has multiple security features - authentication,
authorisation, inflight encryption
• Advanced “enterprise level” security can also provided via
DataStax Enterprise

What is Spark?
46
• Apache Project since 2010
• 10-100x faster than Hadoop MapReduce
• In-Memory Storage
• Single JVM Processor per node
• Rich Scala, Java and Python API´s
• 2x-5x less code
• Interactive Shell

Why Spark on Cassandra?
47
• Data model independent queries
• cross-table operations (JOIN, UNION, etc.)!
• complex analytics (e.g. machine learning)
• data transformation, aggregation etc.
• stream processing (coming soon)
• all nodes are Spark workers
• by default resilient to worker failures
• first node promoted as Spark Master
• Standby Master promoted on failure
• Master HA available in Dactastax Enterprise

How to Spark on Cassandra?
48
• DataStax Cassandra Spark Driver on GitHub
• Compatible with:
• Spark 0.9+
• Cassandra 2.0+
• DataStax Enterprise 4.5+
• Cassandra Tables exposes as Spark RDDs (Resilient Distributed
Datasets
• Read from and write to Cassandra
• Mapping of Cassandra tables and rows to Scala objects
• Scala only driver for now

2.1 Release - User Defined Types
50
CREATE TYPE address (
street text,
city text,
zip_code int,
phones set<text>
)
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
addresses map<text, address>
)
SELECT id, name, addresses.city, addresses.phones FROM users;
id | name | addresses.city | addresses.phones
--------------------+----------------+--------------------------
63bf691f | chris | Berlin | {’0201234567', ’0796622222'}

2.1 Release - Secondary Indexes on
collections
51
CREATE TABLE songs (
id uuid PRIMARY KEY,
artist text,
album text,
title text,
data blob,
tags set<text>
);
CREATE INDEX song_tags_idx ON songs(tags);
SELECT * FROM songs WHERE tags CONTAINS 'blues';
id | album | artist | tags | title
----------+---------------+-------------------+-----------------------+------------------
5027b27e | Country Blues | Lightnin' Hopkins | {'acoustic', 'blues'} | Worrying My Mind

Thanks! Let´s see a demo!
52

BigData Developers MeetUp

More Related Content

What's hot

Viewers also liked

Similar to BigData Developers MeetUp

Recently uploaded

BigData Developers MeetUp