Cassandra 2.0 and timeseries

  • 5,228 views
Uploaded on

At this meetup Patrick McFadin, Solutions Architect at DataStax, will be discussing the most recently added features in Apache Cassandra 2.0, including: Lightweight transactions, eager retries, …

At this meetup Patrick McFadin, Solutions Architect at DataStax, will be discussing the most recently added features in Apache Cassandra 2.0, including: Lightweight transactions, eager retries, improved compaction, triggers, and CQL cursors. He'll also be touching on time series data with Apache Cassandra.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
5,228
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
209
Comments
0
Likes
8

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. ©2013 DataStax Confidential. Do not distribute without consent. @PatrickMcFadin Patrick McFadin Chief Evangelist/Solution Architect - DataStax Cassandra 2.0: Intro + Time Series Friday, October 11, 13
  • 2. Who I am 2 • Patrick McFadin • Solution Architect at DataStax • Cassandra MVP • User for years • Follow me for more: I talk about Cassandra and building scalable, resilient apps ALL THE TIME! @PatrickMcFadin Dude. Uptime == $$ Friday, October 11, 13
  • 3. Cassandra - An introduction Friday, October 11, 13
  • 4. Cassandra - Intro • Based on Amazon Dynamo and Google BigTable paper • Shared nothing • Data safe as possible • Predictable scaling 4 Dynamo BigTable Friday, October 11, 13
  • 5. Cassandra - More than one server • All nodes participate in a cluster • Shared nothing • Add or remove as needed • More capacity? Add a server 5 • Each node owns a token • Tokens denote a range of keys • 4 nodes? -> Key range/4 • Each node owns 1/4 the data Friday, October 11, 13
  • 6. Cassandra - Locally Distributed • Client writes to any node • Node coordinates with others • Data replicated in parallel • Replication factor: How many copies of your data? • RF = 3 here 6 Each node stores 3/4 of clusters total data. Friday, October 11, 13
  • 7. Cassandra - Geographically Distributed • Client writes local • Data syncs across WAN • Replication Factor per DC 7 Single coordinator Friday, October 11, 13
  • 8. Cassandra - Consistency • Consistency Level (CL) • Client specifies per read or write 8 • ALL = All replicas ack • QUORUM = > 51% of replicas ack • LOCAL_QUORUM = > 51% in local DC ack • ONE = Only one replica acks Friday, October 11, 13
  • 9. Cassandra - Transparent to the application • A single node failure shouldn’t bring failure • Replication Factor + Consistency Level = Success • This example: • RF = 3 • CL = QUORUM 9 >51% Ack so we are good! Friday, October 11, 13
  • 10. Cassandra Applications - Drivers • DataStax Drivers for Cassandra • Java • C# • Python • more on the way 10 Friday, October 11, 13
  • 11. Application Example - Layout • Active-Active • Service based DNS routing 11 Cassandra Replication Friday, October 11, 13
  • 12. Application Example - Uptime 12 • Normal server maintenance • Application is unaware Cassandra Replication Friday, October 11, 13
  • 13. Application Example - Failure 13 • Data center failure • Data is safe. Route traffic. 33 Another happy user! Friday, October 11, 13
  • 14. Cassandra 2.0 - Big new features Friday, October 11, 13
  • 15. Five Years of Cassandra Jul-09 May-10 Feb-11 Dec-11 Oct-12 Jul-13 0.1 0.3 0.6 0.7 1.0 1.2 ... 2.0 DSE Jul-08 Friday, October 11, 13
  • 16. SELECT * FROM users WHERE username = ’jbellis’ [empty resultset] Session 1 SELECT * FROM users WHERE username = ’jbellis’ [empty resultset] Session 2 Lightweight transactions: the problem INSERT INTO users (username,password) VALUES (’jbellis’,‘xdg44hh’) INSERT INTO users (userName,password) VALUES (’jbellis’,‘8dhh43k’) It’s a Race! Who wins? Friday, October 11, 13
  • 17. Client (locks) Coordinatorrequest Replica internal request Why Locking Doesn’t Work • Client locks • Write times out • Lock released • Hint is replayed!! Friday, October 11, 13
  • 18. Client (locks) Coordinatorrequest Replica internal request X Why Locking Doesn’t Work • Client locks • Write times out • Lock released • Hint is replayed!! Friday, October 11, 13
  • 19. Client (locks) Coordinatorrequest Replica internal request hint X Why Locking Doesn’t Work • Client locks • Write times out • Lock released • Hint is replayed!! Friday, October 11, 13
  • 20. Client (locks) Coordinatorrequest Replica internal request hint timeout response X Why Locking Doesn’t Work • Client locks • Write times out • Lock released • Hint is replayed!! Friday, October 11, 13
  • 21. Paxos • Consensus algorithm • All operations are quorum-based • Each replica sends information about unfinished operations to the leader during prepare • Paxos made Simple Friday, October 11, 13
  • 22. LWT: details • 4 round trips vs 1 for normal updates • Paxos state is durable • Immediate consistency with no leader election or failover • ConsistencyLevel.SERIAL • http://www.datastax.com/dev/blog/lightweight-transactions-in- cassandra-2-0 Friday, October 11, 13
  • 23. LWT: Use with caution • Great for 1% of your application • Eventual consistency is your friend • http://www.slideshare.net/planetcassandra/c-summit-2013-eventual-consistency- hopeful-consistency-by-christos-kalantzis Friday, October 11, 13
  • 24. UPDATE USERS SET email = ’jonathan@datastax.com’, ... WHERE username = ’jbellis’ IF email = ’jbellis@datastax.com’; INSERT INTO USERS (username, email, ...) VALUES (‘jbellis’, ‘jbellis@datastax.com’, ... ) IF NOT EXISTS; Using LWT • Don’t overwrite an existing record • Only update record if condition is met Friday, October 11, 13
  • 25. CQL Improvements • Cursors • Large result sets now have ->next() functionality • Prevents massive result sets OOMing • No more client side hacks with LIMIT • Warning: Not isolated Friday, October 11, 13
  • 26. CQL Improvements • ALTER DROP • Remove a field from a CQL table. • Conditional schema changes • Only execute if condition met CREATE KEYSPACE IF NOT EXISTS ks WITH replication = { 'class': 'SimpleStrategy','replication_factor' : 3 }; CREATE TABLE IF NOT EXISTS test (k int PRIMARY KEY); DROP KEYSPACE IF EXISTS ks; ALTER TABLE users DROP address3; Friday, October 11, 13
  • 27. CQL Improvements • Aliases in SELECT • Limit and TTL in prepared statements SELECT event_id, dateOf(created_at) AS creation_date, blobAsText(content) AS content FROM timeline; event_id | creation_date | content -------------------------+--------------------------+---------------------- 550e8400-e29b-41d4-a716 | 2013-07-26 10:44:33+0200 | Something happened!? SELECT * FROM myTable LIMIT ?; UPDATE myTable USING TTL ? SET v = 2 WHERE k = 'foo'; Friday, October 11, 13
  • 28. Triggers CREATE TRIGGER <name> ON <table> USING <classname>; DROP TRIGGER <name> ON [<keyspace>.]<table>; • Executed on the coordinator before mutation • Takes original mutation and adds any new • Jars deployed per server Friday, October 11, 13
  • 29. Trigger implementation class MyTrigger implements ITrigger { public Collection<RowMutation> augment(ByteBuffer key, ColumnFamily update) { ... } } • You have to implement your own ITrigger (for now) • Compile and deploy to each server Friday, October 11, 13
  • 30. Experimental! • Relies on internal RowMutation, ColumnFamily classes • Not sandboxed. Be careful! • Expect changes in 2.1 Friday, October 11, 13
  • 31. Cassandra and Time Series Friday, October 11, 13
  • 32. Time Series Taming the beast • Peter Higgs and Francois Englert. Nobel prize for Physics • Theorized the existence of the Higgs boson • Found using ATLAS • Data stored in P-BEAST • Time series running on Cassandra Friday, October 11, 13
  • 33. Use Cassandra for time series Friday, October 11, 13
  • 34. Use Cassandra for time series Get a nobel prize Friday, October 11, 13
  • 35. Time Series Why • Storage model from BigTable is perfect • One row key and tons of (variable)columns • Single layout on disk Row Key Column Name Column Name Column Value Column Value Friday, October 11, 13
  • 36. Time Series Example • Storing weather data • One weather station • Temperature measurements every minute WeatherStation ID 2013-10-09 10:00 AM 2013-10-09 10:00 AM 2013-10-10 11:00 AM 72 Degrees 72 Degrees 65 Degrees Friday, October 11, 13
  • 37. Time Series Example • Query data • Weather Station ID = Locality of single node WeatherStation ID 100 2013-10-09 10:00 AM 2013-10-09 10:00 AM 2013-10-10 11:00 AM 72 Degrees 72 Degrees 65 Degrees Date query weatherStationID = 100 AND date = 2013-10-09 10:00 AM weatherStationID = 100 AND date > 2013-10-09 10:00 AM AND date < 2013-10-10 11:01 AM Date Range OR Friday, October 11, 13
  • 38. Time Series How • CQL expresses this well • Data partitioned by weather station ID and time • Easy to insert data • Easy to query CREATE TABLE temperature ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time) ); INSERT INTO temperature(weatherstation_id,event_time,temperature) VALUES ('1234ABCD','2013-04-03 07:01:00','72F'); SELECT temperature FROM temperature WHERE weatherstation_id='1234ABCD' AND event_time > '2013-04-03 07:01:00' AND event_time < '2013-04-03 07:04:00'; Friday, October 11, 13
  • 39. Time Series Further partitioning • At every minute you will eventually run out of rows • 2 billion columns per storage row • Data partitioned by weather station ID and time • Use the partition key to split things up CREATE TABLE temperature_by_day ( weatherstation_id text, date text, event_time timestamp, temperature text, PRIMARY KEY ((weatherstation_id,date),event_time) ); Friday, October 11, 13
  • 40. Time Series Further Partitioning • Still easy to insert • Still easy to query INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature) VALUES ('1234ABCD','2013-04-03','2013-04-03 07:01:00','72F'); SELECT temperature FROM temperature_by_day WHERE weatherstation_id='1234ABCD' AND date='2013-04-03' AND event_time > '2013-04-03 07:01:00' AND event_time < '2013-04-03 07:04:00'; Friday, October 11, 13
  • 41. Time Series Use cases • Logging • Thing Tracking (IoT) • Sensor Data • User Tracking • Fraud Detection • Nobel prizes! Friday, October 11, 13
  • 42. Thank you! Apache Cassandra 2.0 - Data model on fire Next talk in my data model series! Friday, October 11, 13
  • 43. ©2013 DataStax Confidential. Do not distribute without consent. 39 Friday, October 11, 13