CASSANDRA
IN E-COMMERCE

Alexander Solovyev
solovyov.a.g@gmail.com
A PILOT PROJECT: ONLINE PRODUCT CATALOG
FOR A E-COMMERCE PLATFORM, MIGRATION TO
CASSANDRA
Previous version was based on In-Memory Data Grid Oracle
Coherence. All data from the primary storage (a relational database)
is cached in the data grid.
Goals of the migration:
•  minimization of time required for system restart
•  at least two copies of the data in different data-centers
•  quick and simple backup
ARCHITECTURE IN A NUTSHELL
•  Application server: all business logic + web-services
•  stateless
•  with local caches
•  Data storage
•  Oracle Coherence, then Cassandra via DataStax Java Driver
•  Batch data loading based on Spring Batch
HOW A PRODUCT SHOULD LOOK LIKE TO MEET
THE REQUIREMENTS?
Some hypotheses:
•  Data is on disk – available immediately after restart
•  OS disk cache brings all the data to memory
•  Key-value storage to simplify migration of the codebase
Nice to have:
•  Simple deployment configuration as a plus
•  Java-based solution as a plus
BASIC REQUIREMENTS / USE-CASES
•  reads: ~5K TPS
•  transactions can include more that one round-trip to the
storage, as well as more than one key in a query (ā€œmulti-getsā€)
•  ~50K TPS on side of the storage
•  full data reload (once per 24 hours)
•  partial update of values (e.g. of product attributes)
•  availability 24x7
•  millions of products
•  tens of millions of related entities (product attributes etc.)
CANDIDATES
•  MongoDB
•  HBase
•  Oracle Coherence + data persistence (a la Riak)
•  Cassandra
PERFORMANCE TESTING ENVIRONMENT
•  Production-ready implementation
•  4 boxes (16 CPU, 24 GB) x 1 Cassandra instance
•  2 boxes x 2 app servers
•  100 GB of test data - fits in memory
•  Main test is read queries:
•  one hour
•  up to 500 users
•  even distribution of requested keys
WHAT DID HELP
• 

configure your Cassandra cluster
•  ā€œOS swap offā€
•  different physical disks for different file-sets - e.g. data vs. commit log
•  choose right (ā€œprivateā€) network interface

• 

async queries for multi-gets + token-aware rouring on the app server side:
+15% TPS and latency

• 

use last Cassandra version
•  a good example: 1.2.6 => 1.2.8 – 15% TPS, latency 2x better
WHAT DID HELP
• 

Use the key of a parent entity as a first component of the children keys:
PRIMARY KEY (parent-ID, child-ID)
•  to minimize number of queries / disk seeks
•  +15% TPS, latency 2x better

• 

use local (ā€œnearā€) caches on app server side: +15% TPS
•  local EHCache
WHAT DID NOT HELP
• 

Java GC monitoring on Cassandra boxes
•  with recommended settings GC takes 7% maximum from overall time of
the tests

• 

caching == ALL
•  all data in OS disk cache
INTERESTING EXPERIMENTS
• 

another implementation of the token-aware query routing

• 

JSON or any other data format, if partial updates are not needed – a pure
key-value model
•  allows to avoid creation of tombstones in the case of updates, if values
contain Cassandra collections
•  another option is tuning of tombstone GC
SUMMARY
• 

Cassandra is stable and mature enough product

• 

Can compete with in-memory caches and data grids, at least if dataset is
small enough to be placed into memory

• 

Actively developing. Has a large community. Good commercial support from
DataStax
THANK YOU

…and your questions J

Cassandra in e-commerce

  • 1.
  • 2.
    A PILOT PROJECT:ONLINE PRODUCT CATALOG FOR A E-COMMERCE PLATFORM, MIGRATION TO CASSANDRA Previous version was based on In-Memory Data Grid Oracle Coherence. All data from the primary storage (a relational database) is cached in the data grid. Goals of the migration: •  minimization of time required for system restart •  at least two copies of the data in different data-centers •  quick and simple backup
  • 3.
    ARCHITECTURE IN ANUTSHELL •  Application server: all business logic + web-services •  stateless •  with local caches •  Data storage •  Oracle Coherence, then Cassandra via DataStax Java Driver •  Batch data loading based on Spring Batch
  • 4.
    HOW A PRODUCTSHOULD LOOK LIKE TO MEET THE REQUIREMENTS? Some hypotheses: •  Data is on disk – available immediately after restart •  OS disk cache brings all the data to memory •  Key-value storage to simplify migration of the codebase Nice to have: •  Simple deployment configuration as a plus •  Java-based solution as a plus
  • 5.
    BASIC REQUIREMENTS /USE-CASES •  reads: ~5K TPS •  transactions can include more that one round-trip to the storage, as well as more than one key in a query (ā€œmulti-getsā€) •  ~50K TPS on side of the storage •  full data reload (once per 24 hours) •  partial update of values (e.g. of product attributes) •  availability 24x7 •  millions of products •  tens of millions of related entities (product attributes etc.)
  • 6.
    CANDIDATES •  MongoDB •  HBase • Oracle Coherence + data persistence (a la Riak) •  Cassandra
  • 7.
    PERFORMANCE TESTING ENVIRONMENT • Production-ready implementation •  4 boxes (16 CPU, 24 GB) x 1 Cassandra instance •  2 boxes x 2 app servers •  100 GB of test data - fits in memory •  Main test is read queries: •  one hour •  up to 500 users •  even distribution of requested keys
  • 8.
    WHAT DID HELP •  configureyour Cassandra cluster •  ā€œOS swap offā€ •  different physical disks for different file-sets - e.g. data vs. commit log •  choose right (ā€œprivateā€) network interface •  async queries for multi-gets + token-aware rouring on the app server side: +15% TPS and latency •  use last Cassandra version •  a good example: 1.2.6 => 1.2.8 – 15% TPS, latency 2x better
  • 9.
    WHAT DID HELP •  Usethe key of a parent entity as a first component of the children keys: PRIMARY KEY (parent-ID, child-ID) •  to minimize number of queries / disk seeks •  +15% TPS, latency 2x better •  use local (ā€œnearā€) caches on app server side: +15% TPS •  local EHCache
  • 10.
    WHAT DID NOTHELP •  Java GC monitoring on Cassandra boxes •  with recommended settings GC takes 7% maximum from overall time of the tests •  caching == ALL •  all data in OS disk cache
  • 11.
    INTERESTING EXPERIMENTS •  another implementationof the token-aware query routing •  JSON or any other data format, if partial updates are not needed – a pure key-value model •  allows to avoid creation of tombstones in the case of updates, if values contain Cassandra collections •  another option is tuning of tombstone GC
  • 12.
    SUMMARY •  Cassandra is stableand mature enough product •  Can compete with in-memory caches and data grids, at least if dataset is small enough to be placed into memory •  Actively developing. Has a large community. Good commercial support from DataStax
  • 13.
    THANK YOU …and yourquestions J