Evaluating Apache Cassandra as a Cloud Database 
Christian Johannsen, Solutions Engineer
About me 
2 
Christian Johannsen 
Solutions Engineer @ DataStax 
Twitter: @cjohannsen81
What is a cloud database not? 
3 
• A Cloud database is not simply taking a traditional RDBMS and 
running it in a Cloud provider’s environment
What is a cloud database? 
4 
• A cloud database is a database optimised to handle the 
infrastructure limitations and features of running on 
shared, geographically dispersed infrastructure. 
• Some of them are: Outages, noisy neighbours, security, 
variable networks etc…
Key attributes of a cloud database 
5 
• Transparent Elasticity 
• being able to add and subtract nodes (physical or virtual machines) with 
load balancing when the underlying application and business demands it. 
• Transparent Scalability 
• addition of nodes increases both (1) performance throughput; (2) ability to 
handle Big Data and maintain high performance 
• High Availability 
• always up; no single point of failure 
• Easy Data Distribution 
• able to span multiple geographies, data centers, and cloud provider 
zones. Can read/write to any node
6 
Key attributes of a cloud database 
• Support for multiple data types 
• Able to handle structured, semi-structured, and unstructured data. 
• Easy to manage 
• easy to administer a logical database across many nodes 
• Lower Cost 
• It shouldn’t break the bank and you shouldn’t have to have a massive 
upfront cost 
• Multiple Infrastructure Support 
• It should support being deployed on multiple cloud providers (public and 
private) as well as traditional infrastructure 
• Security 
• Data should be secure in flight, at rest and audited.
Key Attributes at a glance 
7 
1. Transparent Elasticity 
2. Transparent Scalability 
3. High Availability 
4. Easy Data Distribution 
5. Data Redundancy 
6. Support for multiple data types 
7. Easy to manage 
8. Lower Cost 
9. Multiple Infrastructure Support 
10. Security
Apache Cassandra 
8
Apache Cassandra 
9 
• Cassandra is a massively scalable, open source, NoSQL, distributed 
database built for modern, mission-critical online applications. 
• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable 
• Masterless with no single point of failure 
• Distributed, data centre and rack aware 
• 100% uptime 
• Predictable scaling Dynamo 
BigTable 
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf 
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Apache Cassandra Overview 
• Cassandra was designed with the understanding that system/hardware failures 
10 
can and do occur 
• Peer-to-peer, distributed system 
• All nodes the same - masterless with no single point of failure 
• Data partitioned among all nodes in the cluster 
• Custom data replication to ensure fault tolerance 
• Tunable data consistency 
• Data Center and Rack aware 
• Familiar SQL-Like language – CQL 
• More capacity? Add a server! 
• More throughput? Add a server!
Cassandra Adoption 
11
Cassandra - Locally Distributed 
12 
• Replication factor (RF): How many copies of your 
data? 
• RF = 3 in this example 
• Each node is storing 60% of the clusters total data 
i.e. 3/5 
Handy Calculator: http://www.ecyrd.com/cassandracalculator/ 
Node 1 
1st copy 
Node 4 
Node 5 
Node 2 
2nd copy 
Node 3 
3rd copy 
• Client reads or writes to any node 
• Node coordinates with others 
• Data read or replicated in parallel 
Node 2 
2nd copy
Cassandra - Rack/Zone aware 
Rack 1 
13 
Node 1 
1st copy 
Rack 2 Rack 2 
Node 4 
Node 2 
Node 3 
2nd copy 
Rack 3 
Rack 1 
Node 5 
3rd copy 
• Cassandra is aware of which rack or 
zone each nod resides in 
• It will attempt to place each data copy 
in a different rack 
• RF=3 in this example
Cassandra - DC/Region aware 
14 
• Active Everywhere – reads/writes in multiple data centres 
• Client writes local 
• Data syncs across WAN 
• Replication Factor per DC 
• Different number of nodes per 
data center 
DC: USA DC: EUROPE 
Node 1 
1st copy 
Node 4 
Node 5 
Node 2 
2nd copy 
Node 3 
3rd copy 
Node 1 
1st copy 
Node 4 
Node 5 
Node 2 
2nd copy 
Node 3 
3rd copy
Cassandra - Tuneable Consistency 
15 
• Consistency Level (CL) 
• Client specifies per operation 
• Handles multi-data center operations 
ALL = All replicas ack 
QUORUM = > 51% of replicas ack 
LOCAL_QUORUM = > 51% in local DC ack 
ONE = Only one replica acks 
Plus more…. (see docs) 
Blog: Eventual Consistency != Hopeful Consistency 
http://planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-consistency- 
by-christos-kalantzis/ 
Node 1 
1st copy 
Node 4 
Node 5 
Node 2 
2nd copy 
Node 3 
3rd copy 
Parallel 
Write 
Write 
CL=QUORUM 
5 μs ack 
12 μs ack 
500 μs ack 
12 μs ack
Cassandra - Node failure 
16 
• A single node failure shouldn’t bring failure. 
• Replication Factor + Consistency Level = Success 
• This example: 
• RF = 3 
• CL = QUORUM 
Node 1 
1st copy 
Node 4 
Node 5 
Node 2 
2nd copy 
Node 3 
3rd copy 
Parallel 
Write 
Write 
CL=QUORUM 
5 μs ack 
12 μs ack 
12 μs ack 
>51% ack – so request is a success
Cassandra - Node Recovery 
17 
• When a write is performed and a replica node for the row is unavailable the 
coordinator will store a hint locally (3 hours) 
• When the node recovers, the coordinator replays the missed writes. 
• Note: a hinted write does not count the consistency level 
• Note: you should still run repairs across your cluster 
Node 1 
1st copy 
Node 4 
Node 5 
Node 2 
2nd copy 
Node 3 
3rd copy 
Stores Hints while Node 3 is offline
Cassandra Rack/Zone Failure 
Rack 1 
18 
• Cassandra will place the data in as many 
different racks or availability zones as it can. 
• This example: 
• RF = 3 
• CL = QUORUM 
• AZ/Rack 2 fails 
• Data copies still available in Node 1 and 
Node 5 
• Quorum can be honored i.e. > 51% ack 
request is a success 
Node 1 
1st copy 
Rack 2 Rack 2 
Node 4 
Node 2 
Node 3 
2nd copy 
Rack 3 
Rack 1 
Node 5 
3rd copy
Operational Simplicity 
19 
• Cassandra is a complete product – there is not a multitude 
of components to install, set-up and monitor. 
• Extremely simple to administer and deploy 
• Backups are instantaneous and simple to restore 
• Supports snapshots, incremental backups and point-in-time recovery. 
• Cassandra can handle non-uniform hardware and disks. 
o This enables the mixing of solid state and spinning disks in a single cluster and pinning 
tables to workload-appropriate disks. 
• No downtime is required in Cassandra for upgrades or 
adding/removing servers from the cluster.
What ´s up with DataStax? 
20
DataStax at a glance 
21 
330+ 
Employees Percent Customers 
Founded in April 2010 
~25 500+ 
Santa Clara, Austin, New York, London, Sydney
DataStax delivers value 
22 
Certified, 
Enterprise-ready 
Cassandra 
Security Analytics Search Visual 
Monitoring 
Management 
Services 
In-Memory 
Dev. IDE & 
Drivers 
Professional 
Services 
Support & 
Training 
Commercial 
Confidence 
Enterprise 
Functionality
Enterprise Integrations 
23 
• DataStax adds Enterprise Features like: Hadoop, Solr, 
Spark
DataStax OpsCenter 
24 
• DataStax OpsCenter is a browser-based, visual management and 
monitoring solution for Apache Cassandra and DataStax Enterprise 
• Functionality is also exposed via HTTP APIs
OpsCenter - New Cluster Example 
25 
A new, 10-node Cassandra (or A new, 10-node DSE cluste Hr awdiothop O) cplsuCsteern wteitrh r OunpnsCinegn toenr rAunWnSin gin in 3 3 m miinnuutteess…… 
1 2 3 Done
OpsCenter 5.0 
26 
• Manage multiple clusters and nodes 
• Add and remove nodes 
• Administer individual nodes or in bulk 
• Configure clusters 
• Perform rolling restarts 
• Automatically repair data 
• Rebalance data 
• Backup management 
• Capacity planning
DataStax Office Demo 
27 
• 32 Raspberry Pi´s 
• 16 per DataStax Enterprise 4.5 Cluster 
• Managed in OpsCenter 5.0 
• “Red Button” downs one DataCenter 
• Not the Performance-Demo but 
• Availability 
• Commodity Hardware
Native Drivers 
28 
• Different Native Drivers available: Java, Python etc. 
• Load Balancing Policies (Client Driver receives Updates) 
• Data Centre Aware 
• Latency Aware 
• Token Aware 
• Reconnection policies 
• Retry policies 
• Downgrading Consistency 
• Plus others.. 
• http://www.datastax.com/download/clientdrivers
CQL 
29 
• Cassandra Query Language 
• CQL is intended to provide a common, simpler and easier to use 
interface into Cassandra - and you probably already know it! 
• e.g. SELECT * FROM users 
• Usual statements: 
• CREATE / DROP / ALTER TABLE / SELECT
CQL Basics 
CREATE KEYSPACE league WITH REPLICATION = {‘class’:’NetworkTopologyStrategy’, ‘DataCentre1’:3, 
‘DataCentre2’: 2}; 
30 
USE league; 
CREATE TABLE teams ( 
team_name varchar, 
player_name varchar, 
jersey int, 
PRIMARY KEY (team_name, player_name) 
); 
SELECT * FROM teams WHERE team_name = ‘Mighty Mutts’ and player_name = ‘Lucky’; 
INSERT INTO teams (team_name, player_name, jersey) VALUES ('Mighty Mutts',’Felix’,90);
CQL Data Types 
31
DevCenter 1.1 
32 
• Visual Query Tool for Developers and Administrators 
• Easily create and run Cassandra Queries 
• Visually navigate database objects 
• Context-based suggestions
Do you remember the facts? 
33
Key Attributes of a cloud database 
34 
1. Transparent Elasticity 
2. Transparent Scalability 
3. High Availability 
4. Easy Data Distribution 
5. Data Redundancy 
6. Support for multiple data types 
7. Easy to manage 
8. Lower Cost 
9. Multiple Infrastructure Support 
10. Security
Transparent Elasticity 
• Nodes can be added and removed from Cassandra online, with no 
35 
downtime being experienced. 
1 
2 
3 
4 
6 
5 
1 
7 
10 
4 
2 
3 
5 
6 
8 
9 
11 
12
Transparent Scalability 
36 
• Addition of Cassandra nodes increases performance linearly and 
ability to manage TB’s-PB’s of data. 
1 
2 
3 
4 
6 
5 
1 
7 
10 
4 
2 
3 
5 
6 
8 
9 
11 
12 
Performance 
throughput = N 
Performance 
throughput = N x 2
High Availability 
37 
• Cassandra, with its peer-to-peer masterless architecture has no 
single point of failure. 
• 100% uptime is the norm!
Easy Data Distribution 
38 
• Cassandra allows a single logical database to span 1-N Data 
Centers that are geographically dispersed. 
• Also supports a hybrid Cloud implementation.
Data Redundancy 
39 
• Cassandra allows for customisable data redundancy so that data 
is completely protected. 
• Also supports rack/zone awareness (data can be replicated 
between different racks to guard against machine/rack failures).
Support for multiple data types 
40 
• Cassandra’s data model (based on Google’s Bigtable) allows a 
user to store structured, semi-structured, and unstructured data 
with ease. 
Portfolio Keyspace 
Customer Table 
ID Name SSN DOB
Easy to manage 
41 
• OpsCenter, AMI installers etc.. allow you to install and configure 
an entire multi-node Cloud implementation in minutes. 
• All can be managed and monitored via Web-based console or via 
RESTful APIs.
Lower Cost 
42 
• Cassandra is open source software and is freely available. 
• Commercial/advanced versions of Cassandra are available from 
DataStax along with support and additional services.
Multiple Infrastructure Support 
43 
• Cassandra is supported on popular Cloud provider platforms and 
operating systems.
Security 
44 
• Cassandra has multiple security features - authentication, 
authorisation, inflight encryption 
• Advanced “enterprise level” security can also provided via 
DataStax Enterprise
What´s new?! 
45
What is Spark? 
46 
• Apache Project since 2010 
• 10-100x faster than Hadoop MapReduce 
• In-Memory Storage 
• Single JVM Processor per node 
• Rich Scala, Java and Python API´s 
• 2x-5x less code 
• Interactive Shell
Why Spark on Cassandra? 
47 
• Data model independent queries 
• cross-table operations (JOIN, UNION, etc.)! 
• complex analytics (e.g. machine learning) 
• data transformation, aggregation etc. 
• stream processing (coming soon) 
• all nodes are Spark workers 
• by default resilient to worker failures 
• first node promoted as Spark Master 
• Standby Master promoted on failure 
• Master HA available in Dactastax Enterprise
How to Spark on Cassandra? 
48 
• DataStax Cassandra Spark Driver on GitHub 
• Compatible with: 
• Spark 0.9+ 
• Cassandra 2.0+ 
• DataStax Enterprise 4.5+ 
• Cassandra Tables exposes as Spark RDDs (Resilient Distributed 
Datasets 
• Read from and write to Cassandra 
• Mapping of Cassandra tables and rows to Scala objects 
• Scala only driver for now
Spark type mapping 
49
2.1 Release - User Defined Types 
50 
CREATE TYPE address ( 
street text, 
city text, 
zip_code int, 
phones set<text> 
) 
CREATE TABLE users ( 
id uuid PRIMARY KEY, 
name text, 
addresses map<text, address> 
) 
SELECT id, name, addresses.city, addresses.phones FROM users; 
id | name | addresses.city | addresses.phones 
--------------------+----------------+-------------------------- 
63bf691f | chris | Berlin | {’0201234567', ’0796622222'}
2.1 Release - Secondary Indexes on 
collections 
51 
CREATE TABLE songs ( 
id uuid PRIMARY KEY, 
artist text, 
album text, 
title text, 
data blob, 
tags set<text> 
); 
CREATE INDEX song_tags_idx ON songs(tags); 
SELECT * FROM songs WHERE tags CONTAINS 'blues'; 
id | album | artist | tags | title 
----------+---------------+-------------------+-----------------------+------------------ 
5027b27e | Country Blues | Lightnin' Hopkins | {'acoustic', 'blues'} | Worrying My Mind
Thanks! Let´s see a demo! 
52

BigData Developers MeetUp

  • 1.
    Evaluating Apache Cassandraas a Cloud Database Christian Johannsen, Solutions Engineer
  • 2.
    About me 2 Christian Johannsen Solutions Engineer @ DataStax Twitter: @cjohannsen81
  • 3.
    What is acloud database not? 3 • A Cloud database is not simply taking a traditional RDBMS and running it in a Cloud provider’s environment
  • 4.
    What is acloud database? 4 • A cloud database is a database optimised to handle the infrastructure limitations and features of running on shared, geographically dispersed infrastructure. • Some of them are: Outages, noisy neighbours, security, variable networks etc…
  • 5.
    Key attributes ofa cloud database 5 • Transparent Elasticity • being able to add and subtract nodes (physical or virtual machines) with load balancing when the underlying application and business demands it. • Transparent Scalability • addition of nodes increases both (1) performance throughput; (2) ability to handle Big Data and maintain high performance • High Availability • always up; no single point of failure • Easy Data Distribution • able to span multiple geographies, data centers, and cloud provider zones. Can read/write to any node
  • 6.
    6 Key attributesof a cloud database • Support for multiple data types • Able to handle structured, semi-structured, and unstructured data. • Easy to manage • easy to administer a logical database across many nodes • Lower Cost • It shouldn’t break the bank and you shouldn’t have to have a massive upfront cost • Multiple Infrastructure Support • It should support being deployed on multiple cloud providers (public and private) as well as traditional infrastructure • Security • Data should be secure in flight, at rest and audited.
  • 7.
    Key Attributes ata glance 7 1. Transparent Elasticity 2. Transparent Scalability 3. High Availability 4. Easy Data Distribution 5. Data Redundancy 6. Support for multiple data types 7. Easy to manage 8. Lower Cost 9. Multiple Infrastructure Support 10. Security
  • 8.
  • 9.
    Apache Cassandra 9 • Cassandra is a massively scalable, open source, NoSQL, distributed database built for modern, mission-critical online applications. • Written in Java and is a hybrid of Amazon Dynamo and Google BigTable • Masterless with no single point of failure • Distributed, data centre and rack aware • 100% uptime • Predictable scaling Dynamo BigTable BigTable: http://research.google.com/archive/bigtable-osdi06.pdf Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
  • 10.
    Apache Cassandra Overview • Cassandra was designed with the understanding that system/hardware failures 10 can and do occur • Peer-to-peer, distributed system • All nodes the same - masterless with no single point of failure • Data partitioned among all nodes in the cluster • Custom data replication to ensure fault tolerance • Tunable data consistency • Data Center and Rack aware • Familiar SQL-Like language – CQL • More capacity? Add a server! • More throughput? Add a server!
  • 11.
  • 12.
    Cassandra - LocallyDistributed 12 • Replication factor (RF): How many copies of your data? • RF = 3 in this example • Each node is storing 60% of the clusters total data i.e. 3/5 Handy Calculator: http://www.ecyrd.com/cassandracalculator/ Node 1 1st copy Node 4 Node 5 Node 2 2nd copy Node 3 3rd copy • Client reads or writes to any node • Node coordinates with others • Data read or replicated in parallel Node 2 2nd copy
  • 13.
    Cassandra - Rack/Zoneaware Rack 1 13 Node 1 1st copy Rack 2 Rack 2 Node 4 Node 2 Node 3 2nd copy Rack 3 Rack 1 Node 5 3rd copy • Cassandra is aware of which rack or zone each nod resides in • It will attempt to place each data copy in a different rack • RF=3 in this example
  • 14.
    Cassandra - DC/Regionaware 14 • Active Everywhere – reads/writes in multiple data centres • Client writes local • Data syncs across WAN • Replication Factor per DC • Different number of nodes per data center DC: USA DC: EUROPE Node 1 1st copy Node 4 Node 5 Node 2 2nd copy Node 3 3rd copy Node 1 1st copy Node 4 Node 5 Node 2 2nd copy Node 3 3rd copy
  • 15.
    Cassandra - TuneableConsistency 15 • Consistency Level (CL) • Client specifies per operation • Handles multi-data center operations ALL = All replicas ack QUORUM = > 51% of replicas ack LOCAL_QUORUM = > 51% in local DC ack ONE = Only one replica acks Plus more…. (see docs) Blog: Eventual Consistency != Hopeful Consistency http://planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-consistency- by-christos-kalantzis/ Node 1 1st copy Node 4 Node 5 Node 2 2nd copy Node 3 3rd copy Parallel Write Write CL=QUORUM 5 μs ack 12 μs ack 500 μs ack 12 μs ack
  • 16.
    Cassandra - Nodefailure 16 • A single node failure shouldn’t bring failure. • Replication Factor + Consistency Level = Success • This example: • RF = 3 • CL = QUORUM Node 1 1st copy Node 4 Node 5 Node 2 2nd copy Node 3 3rd copy Parallel Write Write CL=QUORUM 5 μs ack 12 μs ack 12 μs ack >51% ack – so request is a success
  • 17.
    Cassandra - NodeRecovery 17 • When a write is performed and a replica node for the row is unavailable the coordinator will store a hint locally (3 hours) • When the node recovers, the coordinator replays the missed writes. • Note: a hinted write does not count the consistency level • Note: you should still run repairs across your cluster Node 1 1st copy Node 4 Node 5 Node 2 2nd copy Node 3 3rd copy Stores Hints while Node 3 is offline
  • 18.
    Cassandra Rack/Zone Failure Rack 1 18 • Cassandra will place the data in as many different racks or availability zones as it can. • This example: • RF = 3 • CL = QUORUM • AZ/Rack 2 fails • Data copies still available in Node 1 and Node 5 • Quorum can be honored i.e. > 51% ack request is a success Node 1 1st copy Rack 2 Rack 2 Node 4 Node 2 Node 3 2nd copy Rack 3 Rack 1 Node 5 3rd copy
  • 19.
    Operational Simplicity 19 • Cassandra is a complete product – there is not a multitude of components to install, set-up and monitor. • Extremely simple to administer and deploy • Backups are instantaneous and simple to restore • Supports snapshots, incremental backups and point-in-time recovery. • Cassandra can handle non-uniform hardware and disks. o This enables the mixing of solid state and spinning disks in a single cluster and pinning tables to workload-appropriate disks. • No downtime is required in Cassandra for upgrades or adding/removing servers from the cluster.
  • 20.
    What ´s upwith DataStax? 20
  • 21.
    DataStax at aglance 21 330+ Employees Percent Customers Founded in April 2010 ~25 500+ Santa Clara, Austin, New York, London, Sydney
  • 22.
    DataStax delivers value 22 Certified, Enterprise-ready Cassandra Security Analytics Search Visual Monitoring Management Services In-Memory Dev. IDE & Drivers Professional Services Support & Training Commercial Confidence Enterprise Functionality
  • 23.
    Enterprise Integrations 23 • DataStax adds Enterprise Features like: Hadoop, Solr, Spark
  • 24.
    DataStax OpsCenter 24 • DataStax OpsCenter is a browser-based, visual management and monitoring solution for Apache Cassandra and DataStax Enterprise • Functionality is also exposed via HTTP APIs
  • 25.
    OpsCenter - NewCluster Example 25 A new, 10-node Cassandra (or A new, 10-node DSE cluste Hr awdiothop O) cplsuCsteern wteitrh r OunpnsCinegn toenr rAunWnSin gin in 3 3 m miinnuutteess…… 1 2 3 Done
  • 26.
    OpsCenter 5.0 26 • Manage multiple clusters and nodes • Add and remove nodes • Administer individual nodes or in bulk • Configure clusters • Perform rolling restarts • Automatically repair data • Rebalance data • Backup management • Capacity planning
  • 27.
    DataStax Office Demo 27 • 32 Raspberry Pi´s • 16 per DataStax Enterprise 4.5 Cluster • Managed in OpsCenter 5.0 • “Red Button” downs one DataCenter • Not the Performance-Demo but • Availability • Commodity Hardware
  • 28.
    Native Drivers 28 • Different Native Drivers available: Java, Python etc. • Load Balancing Policies (Client Driver receives Updates) • Data Centre Aware • Latency Aware • Token Aware • Reconnection policies • Retry policies • Downgrading Consistency • Plus others.. • http://www.datastax.com/download/clientdrivers
  • 29.
    CQL 29 •Cassandra Query Language • CQL is intended to provide a common, simpler and easier to use interface into Cassandra - and you probably already know it! • e.g. SELECT * FROM users • Usual statements: • CREATE / DROP / ALTER TABLE / SELECT
  • 30.
    CQL Basics CREATEKEYSPACE league WITH REPLICATION = {‘class’:’NetworkTopologyStrategy’, ‘DataCentre1’:3, ‘DataCentre2’: 2}; 30 USE league; CREATE TABLE teams ( team_name varchar, player_name varchar, jersey int, PRIMARY KEY (team_name, player_name) ); SELECT * FROM teams WHERE team_name = ‘Mighty Mutts’ and player_name = ‘Lucky’; INSERT INTO teams (team_name, player_name, jersey) VALUES ('Mighty Mutts',’Felix’,90);
  • 31.
  • 32.
    DevCenter 1.1 32 • Visual Query Tool for Developers and Administrators • Easily create and run Cassandra Queries • Visually navigate database objects • Context-based suggestions
  • 33.
    Do you rememberthe facts? 33
  • 34.
    Key Attributes ofa cloud database 34 1. Transparent Elasticity 2. Transparent Scalability 3. High Availability 4. Easy Data Distribution 5. Data Redundancy 6. Support for multiple data types 7. Easy to manage 8. Lower Cost 9. Multiple Infrastructure Support 10. Security
  • 35.
    Transparent Elasticity •Nodes can be added and removed from Cassandra online, with no 35 downtime being experienced. 1 2 3 4 6 5 1 7 10 4 2 3 5 6 8 9 11 12
  • 36.
    Transparent Scalability 36 • Addition of Cassandra nodes increases performance linearly and ability to manage TB’s-PB’s of data. 1 2 3 4 6 5 1 7 10 4 2 3 5 6 8 9 11 12 Performance throughput = N Performance throughput = N x 2
  • 37.
    High Availability 37 • Cassandra, with its peer-to-peer masterless architecture has no single point of failure. • 100% uptime is the norm!
  • 38.
    Easy Data Distribution 38 • Cassandra allows a single logical database to span 1-N Data Centers that are geographically dispersed. • Also supports a hybrid Cloud implementation.
  • 39.
    Data Redundancy 39 • Cassandra allows for customisable data redundancy so that data is completely protected. • Also supports rack/zone awareness (data can be replicated between different racks to guard against machine/rack failures).
  • 40.
    Support for multipledata types 40 • Cassandra’s data model (based on Google’s Bigtable) allows a user to store structured, semi-structured, and unstructured data with ease. Portfolio Keyspace Customer Table ID Name SSN DOB
  • 41.
    Easy to manage 41 • OpsCenter, AMI installers etc.. allow you to install and configure an entire multi-node Cloud implementation in minutes. • All can be managed and monitored via Web-based console or via RESTful APIs.
  • 42.
    Lower Cost 42 • Cassandra is open source software and is freely available. • Commercial/advanced versions of Cassandra are available from DataStax along with support and additional services.
  • 43.
    Multiple Infrastructure Support 43 • Cassandra is supported on popular Cloud provider platforms and operating systems.
  • 44.
    Security 44 •Cassandra has multiple security features - authentication, authorisation, inflight encryption • Advanced “enterprise level” security can also provided via DataStax Enterprise
  • 45.
  • 46.
    What is Spark? 46 • Apache Project since 2010 • 10-100x faster than Hadoop MapReduce • In-Memory Storage • Single JVM Processor per node • Rich Scala, Java and Python API´s • 2x-5x less code • Interactive Shell
  • 47.
    Why Spark onCassandra? 47 • Data model independent queries • cross-table operations (JOIN, UNION, etc.)! • complex analytics (e.g. machine learning) • data transformation, aggregation etc. • stream processing (coming soon) • all nodes are Spark workers • by default resilient to worker failures • first node promoted as Spark Master • Standby Master promoted on failure • Master HA available in Dactastax Enterprise
  • 48.
    How to Sparkon Cassandra? 48 • DataStax Cassandra Spark Driver on GitHub • Compatible with: • Spark 0.9+ • Cassandra 2.0+ • DataStax Enterprise 4.5+ • Cassandra Tables exposes as Spark RDDs (Resilient Distributed Datasets • Read from and write to Cassandra • Mapping of Cassandra tables and rows to Scala objects • Scala only driver for now
  • 49.
  • 50.
    2.1 Release -User Defined Types 50 CREATE TYPE address ( street text, city text, zip_code int, phones set<text> ) CREATE TABLE users ( id uuid PRIMARY KEY, name text, addresses map<text, address> ) SELECT id, name, addresses.city, addresses.phones FROM users; id | name | addresses.city | addresses.phones --------------------+----------------+-------------------------- 63bf691f | chris | Berlin | {’0201234567', ’0796622222'}
  • 51.
    2.1 Release -Secondary Indexes on collections 51 CREATE TABLE songs ( id uuid PRIMARY KEY, artist text, album text, title text, data blob, tags set<text> ); CREATE INDEX song_tags_idx ON songs(tags); SELECT * FROM songs WHERE tags CONTAINS 'blues'; id | album | artist | tags | title ----------+---------------+-------------------+-----------------------+------------------ 5027b27e | Country Blues | Lightnin' Hopkins | {'acoustic', 'blues'} | Worrying My Mind
  • 52.