Cassandra: An Alien Technology That’s not so Alien
Who am I?
• Brian Hess
– Sr. Product Manger for Analytics
• Math and CS background
– Distributed Systems
– Algorithms
• Government data mining research
• Data warehousing (Netezza)
– SQL and data mining and UDF fun
• Joined Datastax 1.5 years ago
– New to NoSQL and Cassandra
© 2015. All Rights Reserved. 2
© 2015. All Rights Reserved. 3
© 2015. All Rights Reserved. 4
Agenda
The Good
• Query Language
• Tooling
• Conceptual Data
Model
The Different
• Application
Methodology
• Connections/Drivers
• Data Modeling
The Dangerous
• Batches
• Secondary Indices
• Lightweight
Transactions
© 2015. All Rights Reserved. 5
Cassandra Query Language
• Very much like SQL
– SELECT, INSERT, DELETE
– WHERE clauses
– CREATE TABLE, GRANT, etc, etc
• SQL syntax for Cassandra operations
– Only supports Cassandra operations
– Not trying to cover the full SQL space
• Missing several things:
– JOIN, GROUP BY, windowed aggregates, subqueries, WITH, etc
© 2015. All Rights Reserved. 6
Cassandra Query Language
• Pop Quiz: CQL or SQL:
1. SELECT a, b, c FROM myData;
2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3;
3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5;
4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3;
5. SELECT MAX(temp) FROM txn
6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON
(a.id=b.id)
© 2015. All Rights Reserved. 7
Cassandra Query Language
• Pop Quiz: CQL or SQL:
1. SELECT a, b, c FROM myData;
2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3;
3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5;
4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3;
5. SELECT MAX(temp) FROM txn
6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON
(a.id=b.id)
© 2015. All Rights Reserved. 8
Tabular Data Model
• Rows and Columns
– Like SQL, R data.frame, etc
• Strong schema
– Each column has a data type
– Custom data types including Map, Set, List, and UDTs
– Can mimic “old style thrift tables” with Map
• Most Cassandra tables really have a schema
– Most data really has a schema
© 2015. All Rights Reserved. 9
Tooling
• Cqlsh
– Command-line CQL interpreter
– Mainly used for management operations
• CREATE KEYSPACE, CREATE TABLE, etc
• GRANT, etc
– Singleton INSERTs
– Some light load/unload operations via COPY
© 2015. All Rights Reserved. 10
Tooling
• DevCenter
– DataStax tool
– “Toad for Cassandra”
© 2015. All Rights Reserved. 11
Data Modeling
• Start with the query
– How will you access this data
– Then create the schema optimized for those queries
– Store it multiple times if you must – “Query Tables”
• Uniqueness – what is an overwrite versus an insert?
– Partition keys – for fast lookups
– Clustering Columns – for ranges, etc
– Mix and match for different query patterns – Users by ID, Users by Email, etc
• No joins
– So, no star schema
– Denormalize!
© 2015. All Rights Reserved. 12
Application Methodology
• Instead of rolling back, we retry
• In SQL you try a transaction, and if error, then rollback
– Leverages transaction isolation and rollback
• In Cassandra you try, and if error, try again
– Leverages idempotent data model and high availability
• “Well, did you want to write that to the database or not?”
– “Instead of Transactions and Rollback, we have Idempotency and Retry”
© 2015. All Rights Reserved. 13
Drivers / Connections
• Multilanguage drivers
– Java, Python, C/C++, PHP, Ruby, etc
• Leverage the whole cluster
– Connect to all the nodes – connect to one and discovering the others
– Load balancing – Round Robin, etc
– Smart routing of queries – “Token Aware Routing”
• Not JDBC / ODBC
– Cassandra-specific, but similar
– Cluster, Session, Statement, ResultSet, etc
– Different data types, etc
– Lower-level configurations – number of connections, paging size, etc
© 2015. All Rights Reserved. 14
Batches
• These are not what you think they are
• How they work…
– Send all the statements to the coordinator
– If “logged”, then a batchlog is written for durability – and replicates it
– Executes each statement – involves other nodes (possibly)
• They are not done as an isolated set of INSERTs
– Well, they are if they all update the same partition (nuance)
– Things will be seen as they are executed
• They do not speed things up
– The coordinator has to do all the work – latency increase and timeouts
– You still need to talk to multiple nodes for multiple INSERTs
– Logged batches require a lot of work by the coordinator
• Maintain a batchlog for durability – and replicates it
• But, they do serve a purpose
– Inter-table consistency (logged batches)
– Some bulk load performance (unlogged batches)
© 2015. All Rights Reserved. 15
Secondary Indices
• These are not what you think they are
• Secondary Indices need to consult every Cassandra node
– Cassandra is optimized for “single node queries”
– Okay for single-partition queries, but not really worth it
• Maintaining secondary indices is also an overhead
• Basically, they rarely make things faster in cases that matter
• Consider Materialized Views in Cassandra 3.0
© 2015. All Rights Reserved. 16
Lightweight Transactions
• These are not what you think they are
• They are not SQL transactions
– They do not support roll-back, etc
• They do allow for serialization and isolation
– Useful for operations like creating accounts
– INSERT INTO users(username, email) VALUES
('cassandra_fan', 'abc@def.com') IF NOT EXISTS
• They are costly
– Paxos algorithm – multiple passes/rounds
• But they do serve a purpose
– Use sparingly
© 2015. All Rights Reserved. 17
Thank you
Cassandra?
© 2015. All Rights Reserved. 19
Cassandra!
© 2015. All Rights Reserved. 20

Cassandra: An Alien Technology That's not so Alien

  • 1.
    Cassandra: An AlienTechnology That’s not so Alien
  • 2.
    Who am I? •Brian Hess – Sr. Product Manger for Analytics • Math and CS background – Distributed Systems – Algorithms • Government data mining research • Data warehousing (Netezza) – SQL and data mining and UDF fun • Joined Datastax 1.5 years ago – New to NoSQL and Cassandra © 2015. All Rights Reserved. 2
  • 3.
    © 2015. AllRights Reserved. 3
  • 4.
    © 2015. AllRights Reserved. 4
  • 5.
    Agenda The Good • QueryLanguage • Tooling • Conceptual Data Model The Different • Application Methodology • Connections/Drivers • Data Modeling The Dangerous • Batches • Secondary Indices • Lightweight Transactions © 2015. All Rights Reserved. 5
  • 6.
    Cassandra Query Language •Very much like SQL – SELECT, INSERT, DELETE – WHERE clauses – CREATE TABLE, GRANT, etc, etc • SQL syntax for Cassandra operations – Only supports Cassandra operations – Not trying to cover the full SQL space • Missing several things: – JOIN, GROUP BY, windowed aggregates, subqueries, WITH, etc © 2015. All Rights Reserved. 6
  • 7.
    Cassandra Query Language •Pop Quiz: CQL or SQL: 1. SELECT a, b, c FROM myData; 2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3; 3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5; 4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3; 5. SELECT MAX(temp) FROM txn 6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON (a.id=b.id) © 2015. All Rights Reserved. 7
  • 8.
    Cassandra Query Language •Pop Quiz: CQL or SQL: 1. SELECT a, b, c FROM myData; 2. SELECT a, b, c FROM myData ORDER BY a LIMIT 3; 3. SELECT cust_id, txn_id FROM txn WHERE cust_id=5; 4. SELECT sensor, MAX(temp) FROM txn WHERE sensor=3; 5. SELECT MAX(temp) FROM txn 6. SELECT a.id, a.info, b.moreinfo FROM a JOIN b ON (a.id=b.id) © 2015. All Rights Reserved. 8
  • 9.
    Tabular Data Model •Rows and Columns – Like SQL, R data.frame, etc • Strong schema – Each column has a data type – Custom data types including Map, Set, List, and UDTs – Can mimic “old style thrift tables” with Map • Most Cassandra tables really have a schema – Most data really has a schema © 2015. All Rights Reserved. 9
  • 10.
    Tooling • Cqlsh – Command-lineCQL interpreter – Mainly used for management operations • CREATE KEYSPACE, CREATE TABLE, etc • GRANT, etc – Singleton INSERTs – Some light load/unload operations via COPY © 2015. All Rights Reserved. 10
  • 11.
    Tooling • DevCenter – DataStaxtool – “Toad for Cassandra” © 2015. All Rights Reserved. 11
  • 12.
    Data Modeling • Startwith the query – How will you access this data – Then create the schema optimized for those queries – Store it multiple times if you must – “Query Tables” • Uniqueness – what is an overwrite versus an insert? – Partition keys – for fast lookups – Clustering Columns – for ranges, etc – Mix and match for different query patterns – Users by ID, Users by Email, etc • No joins – So, no star schema – Denormalize! © 2015. All Rights Reserved. 12
  • 13.
    Application Methodology • Insteadof rolling back, we retry • In SQL you try a transaction, and if error, then rollback – Leverages transaction isolation and rollback • In Cassandra you try, and if error, try again – Leverages idempotent data model and high availability • “Well, did you want to write that to the database or not?” – “Instead of Transactions and Rollback, we have Idempotency and Retry” © 2015. All Rights Reserved. 13
  • 14.
    Drivers / Connections •Multilanguage drivers – Java, Python, C/C++, PHP, Ruby, etc • Leverage the whole cluster – Connect to all the nodes – connect to one and discovering the others – Load balancing – Round Robin, etc – Smart routing of queries – “Token Aware Routing” • Not JDBC / ODBC – Cassandra-specific, but similar – Cluster, Session, Statement, ResultSet, etc – Different data types, etc – Lower-level configurations – number of connections, paging size, etc © 2015. All Rights Reserved. 14
  • 15.
    Batches • These arenot what you think they are • How they work… – Send all the statements to the coordinator – If “logged”, then a batchlog is written for durability – and replicates it – Executes each statement – involves other nodes (possibly) • They are not done as an isolated set of INSERTs – Well, they are if they all update the same partition (nuance) – Things will be seen as they are executed • They do not speed things up – The coordinator has to do all the work – latency increase and timeouts – You still need to talk to multiple nodes for multiple INSERTs – Logged batches require a lot of work by the coordinator • Maintain a batchlog for durability – and replicates it • But, they do serve a purpose – Inter-table consistency (logged batches) – Some bulk load performance (unlogged batches) © 2015. All Rights Reserved. 15
  • 16.
    Secondary Indices • Theseare not what you think they are • Secondary Indices need to consult every Cassandra node – Cassandra is optimized for “single node queries” – Okay for single-partition queries, but not really worth it • Maintaining secondary indices is also an overhead • Basically, they rarely make things faster in cases that matter • Consider Materialized Views in Cassandra 3.0 © 2015. All Rights Reserved. 16
  • 17.
    Lightweight Transactions • Theseare not what you think they are • They are not SQL transactions – They do not support roll-back, etc • They do allow for serialization and isolation – Useful for operations like creating accounts – INSERT INTO users(username, email) VALUES ('cassandra_fan', 'abc@def.com') IF NOT EXISTS • They are costly – Paxos algorithm – multiple passes/rounds • But they do serve a purpose – Use sparingly © 2015. All Rights Reserved. 17
  • 18.
  • 19.
    Cassandra? © 2015. AllRights Reserved. 19
  • 20.
    Cassandra! © 2015. AllRights Reserved. 20