Your SlideShare is downloading. ×
0
Developing with Cassandra
In memory Key-Value:
Redis
CouchBase
File system
BerkeleyDB
Distributed file system
HDFS
Velocity
Volume
Variety
In memory...
Why select Cassandra?
• [relatively] easy to setup
• [relatively] easy to use
• ~zero routine ops
• it works (!!) as promi...
Because Cassandra is Fast!
But needs some time to deliver
• 12'000 WPS on a laptop
• ~0.1 / 1 ms constant latency for writ...
Check References:
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
http://vldb.org/pvldb/vol...
Good For:
• log-like data
TTL helps
• massive writes
1M WPS enough?
• simple real-time analytics
Not So Good For:
• dump o...
Distributed DBMS
Just DBMS - closed monolithic solution
o not a platform to run custom code (as MongoDB);
o not an extensi...
Developed at Facebook for Inbox search
Released to open source in 2008
In use:
• Netflix - main non-content data store~500...
1.0 - October 2011
1.1 - April 2012
1.2 - January 2013
2.0 - expected this summer (2013)
June 26 2013 - 158 bugs, 89 worth...
Apache .tar.gz and Debian
packageshttp://cassandra.apache.org/download/
DataStax DSC - Cassandra + OpsCenter
http://planet...
http://www.slideshare.net/planetcassandra/5-andy-cobley-raspberry-pi
CPU: ARM 700 MHz
RAM: 500 MB
Storage: SD card
Price: ...
Bare metal
CPU: 8 cores (4 works too)
RAM: 16 - 64 GB (min 8 GB)
Storage: rotating disks 3 - 5 TB total (SSD better)
VM wo...
Software Yes & No (production)
?
Software for Development: All Yes
Plus more: Java, Python, C#, PHP, Ruby, Clojure, Go, R, ...
KeyspaceKeyspace
Column
family
Column
family
~ RDBMS DB / Schema
~ RDBMS table
Key1
Key2
Value
Column
Row
Clustering key
P...
Node 3Node 2Node 1
CFCFCF
1
2
3
Client
Parallel reads,writes
1
2
3
Data on Discs
https://github.com/datastax/java-driver
Client API Options
Thrift RPC Native protocol + CQL3
Apache Thrift Custom protocol...
• Forget RDB design principles
• Forget abstract data model
- shape data for queries
• No joins - materialized views
• Dat...
Do & Don't
Patterns and Anti-patterns
Query
Wait
Read
Query
Wait
Read
Parallel it:
slo-o-ow
Sequential Read
CREATE TABLE timeline (
event uuid,
timestamp timeuuid,
...
PRIMARY KEY (event, timestamp)
);
event:date
timestamp
...
CRE...
Insert = Update = Delete
A B C D1
Y1
Z1
1
A Y Z1
a b c d
UPDATE ... SET b = 'Y' WHERE id = 1
INSERT INTO ... SET (id, c) v...
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
Q
Queue: INSERT INTO ...
SET( name...
Remember - eventual consistency.
Concurrent updates
=> wrong count
SELECT count(*)
FROM ... WHERE .... ;
Full scan over th...
CREATE TABLE blob (
id uuid,
data blob,
PRIMARY KEY (id)
);
id
chunk_no
data
CREATE TABLE blob (
id uuid,
chunk_no int,
da...
Unbounded Queries
Result set
Client
OutOfMemory
OutOfMemory
Cursors. Planned to 2.0.
https://issues.apache.org/jira/browse...
Column names - data... yet
Keep them short.
Node 3Node 1
CF
3
Client
2
Node 3Node 2
Why??
slo-o-ow
hot spots
Limit Client to a node
Helpful Links
Sperasoft @ slideshare: http://www.slideshare.net/Sperasoft
Sperasoft @ speakerdeck: https://speakerdeck.com...
Upcoming SlideShare
Loading in...5
×

Developing with Cassandra

1,254

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,254
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
42
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Developing with Cassandra"

  1. 1. Developing with Cassandra
  2. 2. In memory Key-Value: Redis CouchBase File system BerkeleyDB Distributed file system HDFS Velocity Volume Variety In memory NewSQL / domain specific VoltDB Traditional [SQL] RDBMS MongoDB Distributed DBMS Cassandra HBase CAP theorem: • Consistency • Availability • Partition tolerance - pick two Eventual consistency Transactional? • No Choose the DB
  3. 3. Why select Cassandra? • [relatively] easy to setup • [relatively] easy to use • ~zero routine ops • it works (!!) as promised: o real-time replication o node/site failure recovery o zero load writes o double of nodes = double of speed
  4. 4. Because Cassandra is Fast! But needs some time to deliver • 12'000 WPS on a laptop • ~0.1 / 1 ms constant latency for writes/reads
  5. 5. Check References: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf Scalability
  6. 6. Good For: • log-like data TTL helps • massive writes 1M WPS enough? • simple real-time analytics Not So Good For: • dump of junk (consider HDFS) • OLAP (depends on "O") Good and Not So Good
  7. 7. Distributed DBMS Just DBMS - closed monolithic solution o not a platform to run custom code (as MongoDB); o not an extension (as HBase); o highly optimized No-master, eventually consistent NoSQL Data model - Key-Value http://cassandra.apache.org Apache Cassandra
  8. 8. Developed at Facebook for Inbox search Released to open source in 2008 In use: • Netflix - main non-content data store~500 Cassandra nodes (2012) • eBay - recommendation system"dozens of nodes", 200 TB storage (2012) • Twitter - tweet analysis100 + TB of data • More clients: (http://www.datastax.com/cassandrausers) History
  9. 9. 1.0 - October 2011 1.1 - April 2012 1.2 - January 2013 2.0 - expected this summer (2013) June 26 2013 - 158 bugs, 89 worth to notice Sperasoft Experience: • hit 1 bug in production (stability issue) • hit 1 bug in QA (in a crafted case) Mature & Agile
  10. 10. Apache .tar.gz and Debian packageshttp://cassandra.apache.org/download/ DataStax DSC - Cassandra + OpsCenter http://planetcassandra.org/Download/DataStaxCommunityEdition Embedded – for funct. tests on Java apps Maven Documentation http://wiki.apache.org/cassandra/ http://www.datastax.com/docs Distributions
  11. 11. http://www.slideshare.net/planetcassandra/5-andy-cobley-raspberry-pi CPU: ARM 700 MHz RAM: 500 MB Storage: SD card Price: $25 200 WPS! What Hardware?
  12. 12. Bare metal CPU: 8 cores (4 works too) RAM: 16 - 64 GB (min 8 GB) Storage: rotating disks 3 - 5 TB total (SSD better) VM works too, but... Storage: local disks, avoid NAS More on Hardware
  13. 13. Software Yes & No (production) ?
  14. 14. Software for Development: All Yes Plus more: Java, Python, C#, PHP, Ruby, Clojure, Go, R, ...
  15. 15. KeyspaceKeyspace Column family Column family ~ RDBMS DB / Schema ~ RDBMS table Key1 Key2 Value Column Row Clustering key Partitioning key Map < ... , Map < ... , ... > > Data Model
  16. 16. Node 3Node 2Node 1 CFCFCF 1 2 3 Client Parallel reads,writes 1 2 3 Data on Discs
  17. 17. https://github.com/datastax/java-driver Client API Options Thrift RPC Native protocol + CQL3 Apache Thrift Custom protocol Synchronous Asynchronous Schema-less Static schema Store & Forward Cursors promised in 2.0 API for any language Java; Python, C# coming Cryptic API JDBC-like API Supported yet Going forward
  18. 18. • Forget RDB design principles • Forget abstract data model - shape data for queries • No joins - materialized views • Data duplication - OK • Remember eventual consistency • Queries are precious • Use right data types - timestamp, uuid Why? Because NoSQL is a low level tool for high optimization. Data Modeling for NoSQL
  19. 19. Do & Don't Patterns and Anti-patterns
  20. 20. Query Wait Read Query Wait Read Parallel it: slo-o-ow Sequential Read
  21. 21. CREATE TABLE timeline ( event uuid, timestamp timeuuid, ... PRIMARY KEY (event, timestamp) ); event:date timestamp ... CREATE TABLE timeline ( event uuid, date long, timestamp timeuuid, ... PRIMARY KEY ((event, date), timestamp) ); event timestamp ... Still bad - need sharding http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra Long rows - Cassandra handle 2G columns, but... slo-o-ow Timeline
  22. 22. Insert = Update = Delete A B C D1 Y1 Z1 1 A Y Z1 a b c d UPDATE ... SET b = 'Y' WHERE id = 1 INSERT INTO ... SET (id, c) values (1, 'Z') DELETE d FROM ... WHERE id = 1 SELECT * FROM ... WHERE id = 1 have to fetch 4 rows slo-o-ow Plan Data Immutable
  23. 23. http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets Q Queue: INSERT INTO ... SET( name, enqueued_at, payload ) VALUES ( 'Q', now(), ... ) Dequeue: DELETE payload FROM ... WHERE name = 'Q' AND enqueued_at = ... Pick the next: SELECT * FROM ... WHERE id = 1 LIMIT 1 *Q *Q *Q *Q Q Q *? ? ?Q have to fetch 4 rowsslo-o-ow Queue
  24. 24. Remember - eventual consistency. Concurrent updates => wrong count SELECT count(*) FROM ... WHERE .... ; Full scan over the selection => Default 10'000 rows limit => wrong count Have an integer column and increment it CREATE TABLE count_table ( id uuid, value counter, PRIMARY KEY (id) ); ... UPDATE count_table SET value = value + 1 WHERE id = ... ; Counter column family http://www.datastax.com/documentation/cassandr a/1.2/cassandra/cql_using/use_counter_t.html slo-o-ow mess mess How Many?
  25. 25. CREATE TABLE blob ( id uuid, data blob, PRIMARY KEY (id) ); id chunk_no data CREATE TABLE blob ( id uuid, chunk_no int, data blob, PRIMARY KEY (id, chunk_no) ); id data http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage http://wiki.apache.org/cassandra/CassandraLimitations OutOfMemory Blobs
  26. 26. Unbounded Queries Result set Client OutOfMemory OutOfMemory Cursors. Planned to 2.0. https://issues.apache.org/jira/browse/CASSANDRA-4415 C* clients use RPC yet Cassandra slo-o-ow
  27. 27. Column names - data... yet Keep them short.
  28. 28. Node 3Node 1 CF 3 Client 2 Node 3Node 2 Why?? slo-o-ow hot spots Limit Client to a node
  29. 29. Helpful Links Sperasoft @ slideshare: http://www.slideshare.net/Sperasoft Sperasoft @ speakerdeck: https://speakerdeck.com/sperasoft Sperasoft @ github: https://github.com/Sperasoft/Workshop
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×