Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons

5,773 views

Published on

We'll be covering some aspects of our architecture, highlighting differences between MongoDB and Cassandra. We'll go in depth to explain why Cassandra is a better choice for our general purpose Application Platform (SHIFT) as well as our Media Buying Analytics tool (the SHIFT Media Manager). We'll be going over common design patterns people might be familiar with coming from a background with MongoDB and highlight how Cassandra would be used as a better alternative. We'll also touch more on cqlengine which is nearing feature completeness as the Cassandra object mapper for Python.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,773
On SlideShare
0
From Embeds
0
Number of Embeds
799
Actions
Shares
0
Downloads
113
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons

  1. 1. MONGODB TO CASSANDRA ARCHITECTURAL LESSONS ! Jon Hadad & Blake Eggleston
  2. 2. Overview Differences in DB Architectures ! SHIFT Platform ! SHIFT Media Manager ! Intro to cqlengine
  3. 3. MongoDB Architecture Important Concepts • • • • • replica set (master / slave) shard (replica set within a cluster) config server (topology) mongos (router) Shard key is an indexed field that determines the shard a particular document belongs to ! sources: http://docs.mongodb.org/manual/core/sharded-cluster-architectures-production/, http://docs.mongodb.org/manual/core/sharding-shard-key/
  4. 4. Cassandra Architecture • Only 1 type of server (Cassandra) • Ring Based Replication (no master or slave) • No single point of failure • Key hashes to a location in the ring • Replication Factor (RF=3) • Limited query flexibility (always select by key) • Each query has a consistency level source: http://developer.rackspace.com/images/2013-03-27-rackspace-service-registry-status-update/vnodes.png
  5. 5. Cassandra Storage • SSTables are immutable • Each column includes a timestamp of when it was written • The same column can exist for a given key in multiple SSTables • Deletes are written as tombstones • SSTables are periodically merged (compaction) • Compaction keeps the column with the latest timestamp on conflicts source: http://developer.rackspace.com/images/2013-03-27-rackspace-service-registry-status-update/vnodes.png
  6. 6. Cassandra Writes • Writes are written to any node in the cluster (the coordinator) which figures out where it should go
 • Writes are saved in memory to a “memtable”, and written to a commit log.
 • Memtables are flushed to disk periodically as SSTables. source: http://www.datastax.com/docs/_images/write_access.png
  7. 7. Cassandra Reads • Any server may be queried • Acts as coordinator • Data is pulled from SSTables and merged • Contacts nodes with the requested key • Performs read repair if necessary • Reads are a more time consuming operation than writes. source: http://www.datastax.com/docs/_images/write_access.png
  8. 8. MongoDB Advantages • Very Flexible Documents
 • Very Flexible Queries
 • Full text search (2.4)
 • Aggregation Framework
 • Geospatial Indexes / Queries
 • Really good documentation
  9. 9. MongoDB Pitfalls • • ! ! ! Many queries will route to entire cluster ! Overwriting documents / changing doc sizes causes memory fragmentation problems (db repair) • Query language is awkward for humans • Queries that go to disk pay an enormous penalty • Max size of 256GB per collection source: https://blog.serverdensity.com/map-reduce-and-mongodb/
  10. 10. Cassandra Advantages • Multi data center aware & reliable • Fewer moving parts • No DB / table locking • Unbelievable with time series data (stats) • Performance scales linearly as you add servers • Optimized compaction options for traditional spinning disks and SSDs • Lots of control over how your data is stored on disk.
  11. 11. Cassandra Pitfalls • Secondary Indexes have hidden costs • Individual reads (single rows) are not as fast as other DBs • JVM can be intimidating (GC) • Data modeling requires more planning • Generally need to construct a table per query you intend on running • Ad hoc queries or queries with lots of permutations can be very difficult to model • We complement Cassandra with Elastic Search for these types of queries (also Solr & DS Enterprise are good choices)
  12. 12. Media Manager Social Analytics
  13. 13. What is Media Manager? • Ad buying and management tool for Facebook, Twitter
 • We sync ~2 billion ad stats a month
 • We roll up stats at multiple levels in real time
 • 10 node C* cluster, AWS high I/O
 • Peaked at 150K queries / second
 • Approx 150GB of data, growing 10% / week
  14. 14. Real time Rollups • • • A single row per parent object type & date
 campaign +date ad1 ad2 ad3 stats stats stats For any object (teams, folders, campaign) we can perform a rollup for a given date by accessing only a single row. This limits our I/O and is extremely efficient.
 New ad stats are propagated up immediately in rollups with very few reads. rollup campaign1 campaign2 campaign3 folder+date stats stats stats
  15. 15. Why Cassandra? • Almost our entire DB is in our working set.
 • We have rows on disk that are inconsistently sized, so heuristics on doc size for preallocation are not useful.
 • We could not tolerate unpredictable query behavior due to disk access.
  16. 16. SHIFT.com Collaboration Platform
  17. 17. Real time Collaboration • Build for Marketers
 • Allows communication across departments and organizations
 • 3rd Party Applications
  18. 18. Messaging • Messages are fanned out to an entire team
 • Teams may have hundreds of members
 • Each member has perspectival view of their messages and their own metadata on those messages (tags & unread)
  19. 19. Message Inbox • When a message is sent or replied to, we use insert a record with a timeuuid into a persons stream which points to the message.
 • Timeuuids are stored on disk in reverse user timeuuid1 timeuuid2 timeuuid3 jon msg1 msg2 msg3 blake msg3 msg1 msg2 order of the embedded timestamp
 • We can easily query the row for the first N items in the users inbox
 • We store multiple views as tags for each user to quickly surface messages in different contexts.
  20. 20. CQLENGINE python CQL3 mapper
  21. 21. cqlengine features • CQL3 Object Mapper for Python • Supports Cassandra 1.2 • Builds queries supporting the following: • • • • • • • • • TTLs Per Query Consistency Blind Table Updates Batch Queries Counters Maps, sets, lists Schema management Per table compaction settings Table Polymorphism
  22. 22. Table Polymorphism • In a single table we can have heterogenous objects • We use this on Media Manager for Ad types campaign ad type 1 1 page_post 1 2 mobile_ad 1 3 application_ad
  23. 23. Upcoming Features • Work seamlessly with multiple clusters
 • Native driver integration
 • Key cache / row cache configuration
 • Cassandra 2.0 features
 • Third party plugins • session • flask • identity map
  24. 24. THANK YOU Jon Blake jon@shift.com @rustyrazorblade blake@shift.com @beggleston SANTA MONICA 310.310.8315 PALO ALTO 650.804.8319 NEW YORK 646.649.2972 www.shift.com CHICAGO 312.465.2152

×