DataStax C*ollege Credit: What and Why NoSQL?

Aaron Morton
Robin Schumacher
1

• 40 minute webinar
• 15 minute Q+A
• #CassandraQA
• WebEx Q&A window
• Slides and recording will be available
• Next webcast:
• Time for a new relationship?(Information Week)
• September 26th

2

Aaron Morton (@aaronmorton)
DataStax MVP for Apache Cassandra
Aaron Morton is a Freelance Developer based in New Zealand, and a
Committer on the Apache Cassandra project. In 2010 he gave up the RDBMS
world for the scale and reliability of Cassandra. He now spends his time
advancing the Cassandra project and helping others get the best out of it.
www.thelastpickle.com

3

Robin Schumacher
VP of Products @ DataStax
Robin Schumacher has spent the last 20 years working with databases and big data.
Before DataStax he was at EnterpriseDB, where he built and led a market-driven
product management group. Previously, Robin started and led the product
management team at MySQL for three years before they were bought by Sun, and
then by Oracle. He also started and led the product management team at
Embarcadero Technologies.
Robin is the author of three database performance books and frequent speaker at
industry events. Robin holds BS, MA, and Ph.D. degrees from various universities.

4

First ANSI
1986
standard.
1989 FOREIGN KEY
New types, JOIN,
1992 DDL, Transaction
Isolation Levels
1999 Triggers
9

First public
1996, v3.19
release

MyISAM engine,
1999, v3.23
no Transactions
InnoDB, ACID
2001, v4.X Transactions,
FOREIGN KEY 10

PRIMARY KEY,
1995, v6.0
FOREIGN KEY
1996, v6.5 JOIN
NVARCHAR,
1998, v7.0
replication
Referential
2000, v2000
Integrity actions
11

Small limited
1989, v1.0
release
1997, v6.2 Triggers

1998, v6.3 Sub selects
MVCC
1999, v6.5.3
Transactions
FOREIGN KEY,
2000, v7.0.3
JOIN 12

• Adds application complexity
• Adds operational complexity
• Thundering Herds
• “There are 2 hard problems in computer science:
caching, naming, and off-by-1 errors”

13

• Adds operational complexity
• Schema defined in multiple databases
• SPOF for shard
•Hard to grow and keep balanced

14

• Fail over may add application complexity
• Unknown asynchronous delay in
replication
• Potentially wasting resources on Slave
• Reliability of passive Slave is unknown
• “We failed to fail over to the slave.”

15

• Unknown asynchronous delay in replication
• SPOF for writes

16

• ALTER TABLE locks the table
• Must be applied to many individual servers
• “foo varchar(50) DEFAULT NULL”

17

2007 Tokyo Cabinet
2009 Redis
2009 Voldemort
2009 Riak

22

2008 Apache Couch DB
2009 MongoDB

23

2007 Neo4J
2009 Infogrid
2010 InfiniteGraph

24

Apache Hbase (as
2007
part of Lucene)
BigTable as part of
2008 / 2011
Google App Engine

2009 Apache Cassandra

2012 Amazon DynamoDB

25

• Cluster based
• Replication built in
• No schema or flexible schema
• Expect node failure

26

• Aaron Morton
• @aaronmorton
• www.thelastpickle.com

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
27

“NoSQL is the stuff of the Internet
Age.”
- Andrew Oliver,
InfoWorld

29

What Characterizes the “Internet Age” with data?

1. Big Data – Concerns…
• Scaling data velocity, variety, volume
2. Data in the Cloud – Promises…
• Transparent elasticity
• Scalability
• Availability
• Ease of use (data distribution, redundancy, etc.)
• All these also needed on premise…
3. Data “everywhere” – needing to support multiple
data centers, geographies, etc.

30

Why NoSQL?
You have Big Data use cases.
• Volume, variety, volume
• Complexity of data distribution
• Future proof apps where scaling is concerned

“Big data technologies describe a new
generation of technologies and
architectures, designed to economically
extract value from very large volumes of
a wide variety of data, by enabling high-
velocity capture, discovery, and/or
analysis ” - IDC

31

Why NoSQL?
Cassandra – a massively scalable NoSQL database
• Superior write performance for data velocity
• Strong data type support for data variety
• Linear scalability/scale out for data volume
• Fast for both reads and writes

“We‟ve seen a 700% performance
improvement, while our database grew over
500% at the same time. Plus we‟ve saved
40% in operational costs.” - SourceNinja

32

Why NoSQL? Cassandra and Performance
“In terms of scalability, there is a clear winner
throughout our experiments. Cassandra
achieves the highest throughput for the
maximum number of nodes in all experiments
with a linear increasing throughput.”
Solving Big Data Challenges for Enterprise Application Performance Management, Tilman Rable,
et al., August 2012, p. 10. Benchmark paper presented at the Very Large Database Conference,
2012. http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf

In the Cloud… In Web Apps…

YCSB Benchmark
Source: http://blog.cubrid.org/dev-platform/nosql-
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-
benchmarking/?utm_source=NoSQL+Weekly+List&utm_campaign=143fae8
on.html
6b2-NoSQL_Weekly_Issue_41_September_8_2011&utm_medium=email
33

Why NoSQL?
You need continuous availability.
• Different than high availability
• For applications that can’t go down
• May involve one or multiple locations

34

Why NoSQL?
Cassandra – a continuously available NoSQL DBMS
• Built to overcome the fact that hardware failures can and do
occur
• No single point of failure
• Out-of-the-box redundancy of function and data

“For us, the primary motivating factors are continuous
availability and multi-data center support. We also like
the fact that we can trust Cassandra; when we need to
write data, we don‟t have to worry that it‟s going to get
written and be there no matter what.”- RightScale

35

Why NoSQL?
You need true location independence.
• Need to read AND write data anywhere
• Data is eventually synchronized in all locations
• Keep data local for fast access

36

Why NoSQL?
Cassandra – a location independent database
• Replication is multi-data center, multi-directional capable
• Handles multiple cloud geo-zones
• Supports hybrid on-premise/cloud deployments
• Tunable data consistency

“I can create a Cassandra cluster in any region
of the world in 10 minutes. When marketing
decide we want to move into a certain part of
the world, we‟re ready.”- Netflix

37

Why NoSQL?
You need real-time, transactional capabilities
• For applications needing ACID, use RDBMS
• For applications without ACID requirements, but with
transactional needs, use NoSQL
• The “C” is ACID does not apply to NoSQL; the “C” in the CAP
theorem does

“Ninety-five percent (95%) of database-driven
systems today don‟t need ACID transactions.”
– Dan McCreary, The CIO‟s Guide to NoSQL
Webinar

38

Why NoSQL?
Cassandra – real-time NoSQL transactions
• Supports AID transactions: atomic, isolated, and durable
• Provides tunable data consistency – per operation – to
handle the “C” in the CAP theorem
• No ACID “C” as there are no referential integrity/foreign key
constraints

“Cassandra stands at the front of the NoSQL
pack when it comes to supporting real-time,
Big Data applications.” – Wikibon

39

Why NoSQL?
You need a more flexible/agile data model.
• Escape the rigidity of the relational data model
• Able to easily store and access all data types
• Few worries about performance of “wide” rows

40

Why NoSQL?
The Cassandra Data Model - Bigtable
• A row-oriented, column structure
• A column family is similar to an RDBMS table but is
more flexible/dynamic
• A row in a column family is indexed by its key.
Other columns may be indexed as well

“Cassandra‟s NoSQL data model allows us Keyspace
to insert and query data much more
naturally than what we had previously. The Column Family
analysts who routinely use this data were
impressed with the flexibility and speed at ID Name SSN DOB

which the queries came back.” - NASA

41

Why NoSQL?
You need a better architecture.
• Master/slave – inherent issues; write bottlenecks
• Sharding – difficult to setup/maintain
• Shared storage – has availability concerns

42

Why NoSQL?
Cassandra – a “masterless” architecture
• Peer-to-peer design
• No write bottlenecks
• No manual sharding or shared storage issues
• Less operational overhead

“Cassandra was just a better design all around
– more truly horizontally scalable and with less
management overhead – and there‟s no single
point of failure. I looked at Cassandra‟s
architecture and thought, „Yeah, that‟s how you
do it.‟” - Backupify

43

Why NoSQL?
Because you need…
• The ability to handle big data use cases
• Continuous availability vs. high availability
• A location independent database
• A real-time, transactional database
• A more flexible/agile data model
• A better architecture

44

Key Cassandra Use Cases
• Real-time, big data workloads
• Time series data management
• High-velocity device data consumption and analysis
• Media streaming management (e.g., music, movies)
• Social media (i.e., unstructured data) input and analysis
• Online web retail (e.g., shopping carts, user transactions)
• Real-time data analytics
• Online gaming (e.g., real-time messaging)
• Software as a Service (SaaS) applications that utilize web
services
• Online portals (e.g. healthcare provider/patient interactions)
• Most write-intensive systems

45

Why NoSQL?

- The CIO‟s Guide to NoSQL, Dan McCreary

46

• Cassandra.Apache.org
• PlanetCassandra.org
• Datastax.com

47

DataStax C*ollege Credit: What and Why NoSQL?

More Related Content

What's hot

Viewers also liked

Similar to DataStax C*ollege Credit: What and Why NoSQL?

More from DataStax

Recently uploaded

DataStax C*ollege Credit: What and Why NoSQL?

Editor's Notes