Successfully reported this slideshow.
Your SlideShare is downloading. ×

Datastax / Cassandra Modeling Strategies

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 22 Ad
Advertisement

More Related Content

Slideshows for you (20)

Similar to Datastax / Cassandra Modeling Strategies (20)

Advertisement

More from Anant Corporation (20)

Recently uploaded (20)

Advertisement

Datastax / Cassandra Modeling Strategies

  1. 1. DataStax / Cassandra Data Modeling Strategies Avoiding The Three Stooges: Wide Partitions, Tombstones, Data Skew Rahul Xavier Singh Anant Corporation
  2. 2. TOC Core Concepts Wide Partitions Data Modeling Synthetic Sharding Key Design Tombstones Data Skew Avoid tombstones
  3. 3. Business Platform Success We build business success platforms, which are collections of systems that serve business processes that have information needs for people.
  4. 4. Platform Thinking
  5. 5. How? Project Information Client Service Information Corporate Guides Collaborative Documents Assets & Files Corporate Assets Business Platform ● Curateframeworkof systems. ● Workwitha vettedteam of experts. ● Connectit all together. ● Focuson finding, analyzing,and actingon knowledge& communicationtowards businesssuccess.
  6. 6. Streamline. Organize. Unify. Business Platform
  7. 7. Who we help Succeed
  8. 8. Cassandra / DataStax Core Concepts
  9. 9. Cassandra Architecture Cluster / Data Centers 01Cassandra is not for tiny data. Do you NEED: 1. Fast read and write of terabytes of data? 2. Replication / availability around the world? 3. Never go down, always up? Don’tuse Cassandra: 1. If you have gigabytes of data. 2. Your application can chill in one datacenter. 3. Your system can go down whenever it wants. 4. Want to be cool.
  10. 10. Cassandra Data Model Keyspaces & Tables 02 Cassandra Tables / Column Families look like SQL Server / MySQL / Postgres tables & databases. They are not. 1. CQL Supports queries with a Primary and optional Clustering Key 2. CQL Does not support arbitrary queries on columns. 3. Cassandra shouldn’t be managing more than a 100- 150 tables across any number of keyspaces.
  11. 11. Cassandra Operations Read / Write Paths 03 Cassandra does these things well. 1. Write: It writes data in an immutable way at first into a commit log, adds it to the memtable to be available, and then flushes it to disk: sstables. 2. Read: It figures out if the data is on a node (Orlando Bloomfilter is involved) and reads from different sstables, reconciles the immutable data + deletes into the latest data. 3. It spreads the load around the ring so that you can hundreds of nodes doing this and not break a sweat: beast like performance.
  12. 12. Cassandra Operational Pitfalls Visualized
  13. 13. Wide Partitions 01 Wide partitions will completely screw you you over on reads and take a node out if there’s traffic. 1. Monitor using cfstats (CompactedPartitionMaximumBytes) 2. Monitor in system.log “Compacting large partition” 3. Monitor using toppartitions 4. Monitor using OpsCenter (if usingDataStax)
  14. 14. Data Skew 02 Bad key design can lead to really, really bad data skew. In some cases if the number of keys is only 1 or 2, that means that the data only exists in one or two partitions replicated. 1. Monitor using cfstats(NumberOfKeys, SpaceUsedLive, ReadCounts, WriteCounts) 2. Monitor using OpsCenter (if usingDataStax)
  15. 15. Tombstones 03 How to check for tombstones. 1. Monitor using cfstats(*Tombstones) 2. Monitor using syslog (“Tombstone Warn Threshold”) 3. Monitor using OpsCenter (if usingDataStax)
  16. 16. Cassandra Data Modeling Best Practices
  17. 17. Good Key Design 01 Somethingsto NOTDO. 1. Avoid using Integer/Longkeys unless you couple it with another composite partition key. (Unless you can somehow show through realistic data generation that it won’t coalesce data in some nodes) 2. Avoidusing Time/Date based keys or TimeUUID unless you know for damn sure that you are going to continuously create data at a given interval all day, every day. 3. Don’t just import relational data and expect it to magically work. SomethingsTODO. 1. UUIDwill most likely work fine for any given table, but how do you find it again? You will need to have another table that has that information. 2. If you must use human readable keys, you can use a synthetic shardingmechanism. Next Slide. 3. Can combine known things and take a chance but should test with load: (String, Integer , String ,Integer) . Somethingsto REMEMBER 1. Clustering Keysdon’tspreaddataaroundthecluster. 2. Remember ( Partition Key,ClusteringKey) are different((PartitionKey1, Partition Key2)) 3. UseRealistic Data: To properly scaleCassandra or anyother Systemyouneedto create realistic data.
  18. 18. Spreading Data via Synthetic Sharding 01 Sometimes you need to use the key that you have which is human readable because that is the query path. How do deal with that? 1. Primary Key : ((CountryName, StateName, CityName, CompanyName)) 2. Integer Shard Added ((CountryName, StateName, CityName, CompanyName, ShardNumber)) 3. ShardNumber couldbe 1-10, or 1-100dependingon howbadly your datais spreading. Let’s say you are using a time based key and notice coalescing around a particular time of day, you could consider the weekday itself as a part of the key . 1. Primary Key : (CreatedDate) 2. Week Day Number ((CreatedDate, WeekDay)) 3. WeekDay would be 0-6 mapped to Sunday-Saturday
  19. 19. Just say now to Tombstones! The reason tombstones exist is to make it possible to do insanely fast writes and updates and still be able to send the data back performantly. (Side conversation on Queues as Anti-pattern) 1. There is no need to set null values or delete data actively. 2. You can always do soft deletes or use TTL values that expire data automatically. 3. Watch out for prepared statements sending nulls. Avoiding Tombstones 01
  20. 20. Questions?
  21. 21. Confidential Customized for Lorem Ipsum LLC Version 1.0 We’re Partnering / Hiring 1. Professional Services Datastax, Sitecore, Spark, Docker, Solr, Cassandra, Kafka, Elastic, AWS, Azure 2. Digital Services React/Angular, TypeScript, ASP.NET, Node, Python
  22. 22. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037 Data & Analytics Cassandra, DataStax, Kafka, Spark Customer Experience Sitecore Information Systems Salesforce, Quickbooks, and more

×