Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intuit: Reporting from the Trenches: Using Cassandra Effectively

509 views

Published on

At Intuit Data Engineering and Analytics, we work on multiple products and offerings from Profile Store to deeply personalized A/B testing platforms.

In this session, I will touch upon cassandra usage at Intuit on personalized A/B testing platform, concerns we faced, and the learnings we had. We hope that this sharing of both issues, and what we learned helps to mitigate problems from more cassandra users end, and prevent them. We created a scalable, highly available, responsive personalized A/B testing platform on AWS and Cassandra as our NoSQL backend.

Concerns/Interesting patterns:
Constant Long Garbage Collection
Repair takes a long time.
Restacking Pains
Potential Data Loss post decommissioning nodes
Nodetool decommission logs error silently
Opscenter had performance impact on production
Strange status - /etc/init.d/dse status showed running, but cqlsh would not start.
TokenRangeOffline exception
SSTableLoader does not work to stream sstables with internode-encryption enabled.

Learnings:
Track the replication factor is correctly set
Check the read and write quorums, if set differently on different modules for expected behavior.
It is a denormalized structure so create tables judiciously
Index date time field if select will require where clauses on datetime.
Always do heap, garbage, thread monitoring for cassandra.
Always take current snapshots before attempting a restacking.
Have a data recovery strategy, regular snapshots moved to S3.
Configure cassandra yaml/cassandra-env.sh correctly for GC/heap_dump
Understand the capabilities of nodetool cfstats/tpstats/compactionstats/netstats
Understanding compaction, tombstones
DSE 4.7 has good data migration capabilities, and faster repair times.

With the support from Datastax, we were able to have a great tax season and serve our users.Still some few puzzling pieces that we are working with Datastax on. We hope this sharing will help other Cassandra users to use it most effectively!!

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Intuit: Reporting from the Trenches: Using Cassandra Effectively

  1. 1. Rekha Joshi Intuit, Inc. Reporting From Trenches: Using Cassandra Effectively!
  2. 2. Who Am I? Staff Engineer at Intuit Inc. Oreilly Certified Apache Cassandra Professional
  3. 3. Good Software?
  4. 4. And a Truly Successful Software?
  5. 5. All This Data!!!!!!!!!!
  6. 6. Can I Lift This Alone?
  7. 7. Need For Speed
  8. 8. Cassandra,who? Cassandra is a Java based NoSQL, linearly scalable, best in class tunable performance, fault tolerant, distributed, masterless, time series database.
  9. 9. DynamoDB (Amazon) Big Table (Google) Cassandra Inherits data distribution Inherits data model Masterless Architecture Linear Scalability Tunable Consistency/Performance Application Query Access Patterns influencing influencing Cassandra: The Hybrid Kid has the Edge!
  10. 10. Intuit And Cassandra Cassandra = Intuit Technology Standard of Choice for NoSQL Distributed Database
  11. 11. Intuit On Mission
  12. 12. Personalized AB Testing Platform
  13. 13. Advanced Security Analytics Options Advanced Tools Cassandra And DataStax Enterprise
  14. 14. Your Worries?
  15. 15. Fantasy And Engineering Fantasy
  16. 16. Application live on internal network Blank Slate Application live on AWS Security approved Data Security, Encryption System happy, load tested, multiple releases, customers happy Learnings – How? Why? Successful Mini Peak Traffic, Paranoid Monitoring Application releases use cases, Refactorin g Data Model, Excellent Peak Tax season!!! Oct Start Oct End Nov Dec AprMarFebJan Trusting -> Paranoid -> Seasoned
  17. 17. Garbage Collection Issue
  18. 18. Clock Issue
  19. 19. Understand the Node Ring Nodetool status Nodetool ring Nodetool info Nodetool cfstats Nodetool tpstats Repeat after me: Cassandra is a Java based NoSQL linearly scalable, best in class tunable performance, fault tolerant, distributed, masterless, time series database.
  20. 20. What If A Node Goes Down?
  21. 21. Tuning The Application Refactor data model Revisit the usage access patterns Paranoid Monitoring Repeat after me: Cassandra is a Java based NoSQL linearly scalable, best in class tunable performance, fault tolerant, distributed, masterless, time series database.
  22. 22. Tuning For Reads
  23. 23. Tuning For Writes
  24. 24. Tuning The System
  25. 25. Little Talked Aspect Of The Pareto Principle!
  26. 26. Heavy Lifting? Easy!
  27. 27. Thank You! https://www.linkedin.com/in/rekhajoshm https://twitter.com/rekhajoshm

×