Cassandra

1,570 views

Published on

Published in: Technology, Business
  • Be the first to comment

Cassandra

  1. 1. Robert Koletka
  2. 2. What is Cassandra● Basically a key value store ● With some stuff.● It is a NoSQL database that is ● Decentralized : No single point of failure ● Elastic : Linear Scalability ● Fault Tolerant : Replication ● Optimized for writes, reads dont do badly at all though.
  3. 3. What is Cassandra● Based on two papers ● Bigtable: Google ● Dynamo: Amazon● Dynamo partitioning and replication● Bigtable data model● CAP Theorem ● Consistent NO ● Available YES ● Partition Tolerant YES
  4. 4. What is Cassandra● Uses consistent hashing
  5. 5. Data Model● Cluster● Keyspace : like a DB● Column Families : like a Table● Super Columns (optional)● Columns● Values
  6. 6. Data Model● Keyspace groups column families together● Column Family groups data together● Example : ● User Keyspace has – UserProfiles Column Family – Friends Column Family
  7. 7. Data Model● Cassandra doesnt require schemas like traditional DBs● UserProfiles Example ● Rk = {Name:Robert, Surname:Koletka, Gender:Male} ● Js = {Name:John, Surname:Smith, Location:WC}● Rk & Js both valid entries in UserProfiles Column Family even though different columns.
  8. 8. Data Model● Think about QUERIES not de-normalizing data.● Use Case: “I want to get friends names and surnames for a given UserID”● Name & Surname needs to be in the friend column family.● Js = {Ac:{Name:Alice,Surname:Cook}, Bb: {Name:Betty,Surname:Blah}} (Super)● Js = {Ac:”Alice Bob”, Bb:”Betty Blah”}
  9. 9. Data Model● Column ● Rowkey = {ColumnName:Value,CN=V,CN=V}● Super Column ● Rowkey = {SuperColumnName:{CN=V,CN=V}, SCN:{CN=V,CN=V}}● Super Columns group columns together● Cannot index on a Sub column.
  10. 10. Define Keyspace● create keyspace <keyspace> with <att1>=<value1> and <att2>=<value2> ...;● create keyspace UserKeyspace with placement_strategy = org.apache.cassandra.locator.SimpleStrategy and strategy_options = {replication_factor:2};● Simple Strategy – place replica on next node● NetworkTopologyStrategy – for multiple data centers● OldNetworkToplogyStrategy – different data centers and different racks
  11. 11. Define Column Family● create column family <name> with <att1>=<value1> and <att2>=<value2>...;● create column family UserProfiles with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type and column_metadata=[{column_name:Location, validation_class:UTF8Type, Index_Type:KEYS}];
  12. 12. Define Column Family● Comparator = Column Name validator and compare column names● default_validation_class = Validation for values in columns which are not listed in column_metadata● key_validation_class = Validate key● Default is BytesType
  13. 13. Define Column Family● Other Available Types ● AsciiType ● BytesType ● CounterColumnType (distributed counter column, a CF either contains counters or non at all) ● Int32Type ● IntegerType (a generic variable-length integer type) ● LexicalUUIDType ● LongType ● UTF8Type
  14. 14. Define Column Family● Many more options ● bloom_filter_fp_chance : false positives ● gc_grace : garbage collection ● keys_cached ● row_cache_save_period ● max_compaction_threshold ● ...
  15. 15. Read and Writes● Cassandra is optimized for writes ● First written to a commitlog ● Then to an in-memory table (memtable) ● Then periodically written to disk (SStable)● Reads ● Read from all SStables and memtables ● Bloom filters used to speed up Sstable lookups● Compaction ● Periodically Cassandra merges SStables
  16. 16. Indexes● Row Keys ● Cassandra keeps and index of its Row keys● Column Indexes ● Known as Secondary Indexes, build an index on column values. ● Indexes existing data in the background ● Query by using equality predicates – Then additional filters
  17. 17. Indexes● Get userprofiles where location = WC● Get userprofiles where location = WC and age > 18● NOT ● Get userprofiles where age > 18
  18. 18. Consistency● Allows for configurable consistency settings ● Read – One, Quorum((Replication Factor / 2) +1), Local/Each_Quorum (Data Centers), All. ● Write – Any, One, Quorum, Local/Each_Quorum (Data Centers), All.● Any means that data can be written to co-ordinator if replicas are down till replicas come back up.● Quorum allows for some consistency and tolerating some failures.● All replicas must be up.
  19. 19. Consistency● Read ● At least one node needs to be up to read data from, obvious. ● Reads from a number of replicas returning the latest data, based on timestamp. ● Read repair ensures data remains consistent, updates out of date nodes with latest data. Runs in background.
  20. 20. Cassandra Query Language● Allows for ● Select – SELECT [FIRST N] [REVERSED] <SELECT EXPR> FROM <COLUMN FAMILY> [USING <CONSISTENCY>] [WHERE <CLAUSE>] [LIMIT N]; – SELECT [FIRST N] [REVERSED] name1..nameN FROM – Unlike SQL, no guarantee that columns will be returned – SELECT ... WHERE KEY >= startkey and KEY =< endkey AND name1 = value1 ● Insert ● Delete ● Update ● Batch ● Truncate ● Create Keyspace ● Create Column Family ● Create Index ● Drop
  21. 21. Other Stuff● Cassandra stores columns in sorted order ● Allows you to get the first or last X number of columns ● Potentially store historical user data● Single column cannot hold more than 2gb● Max number of columns per row is 2 billion● Key and Column Names must be <64kb● Most Languages have client libraries (Python, Java, Scala, Node.js, PHP, C++...)● Try not to use raw thrift.
  22. 22. Last Example● User Statuses● Columns stored in sorted order... use timestamp as column name● Rk = {1:Good morning all,2:lunch was good,3:time to get drunk,4:so many regrets from last night}● Create column family UserStatuses with comparator = LongType and Key_validation_class=UTF8Type and default_validation_class=UTF8Type● Get last X number of Columns, Get first X number of columns

×