46. Image Credits haystack http://www.flickr.com/photos/james_lumb/3921968993 pyramids http://www.flickr.com/photos/gracewong/93631410 scales http://www.flickr.com/photos/eflon/3465042138 friends http://www.flickr.com/photos/ngmmemuda/4166182931 television http://www.flickr.com/photos/angelrravelor/314306023 columns http://www.flickr.com/photos/nostri-imago/3564300653 devil http://www.flickr.com/photos/52890443@N02/4887855756 angel http://www.flickr.com/photos/75001512@N00/4938623021 transaction http://www.flickr.com/photos/neubie/2273635564 queries http://www.flickr.com/photos/-bast-/349497988 rings http://www.flickr.com/photos/baldur/4395738741 indexes http://www.flickr.com/photos/waferboard/4137041591 panic http://www.flickr.com/photos/pasukaru76/3998981988 procedures "The Anatomy Lesson of Dr. NicolaesTuip" by Rembrandt relational http://www.flickr.com/photos/35536700@N07/3292544674 desert http://www.flickr.com/photos/waldenpond/4252575735 jet http://www.flickr.com/photos/rmahle/709685 queries http://www.flickr.com/photos/andreanna/2812118063 blackboard http://www.flickr.com/photos/shonk/418180402 normal http://www.flickr.com/photos/infrogmation/3180606117 phonograph http://www.flickr.com/photos/shiyazuni/4770244591 dodo http://www.flickr.com/photos/wheatfields/2071347416 Internals http://www.flickr.com/photos/37hz/4057856826 writing http://www.flickr.com/photos/stevendepolo/3877225152 consistency http://www.flickr.com/photos/betsyweber/4962297050 partitioning http://www.flickr.com/photos/featheredtar/3137028766 slices http://www.flickr.com/photos/free-stock/4899674517 summary http://www.flickr.com/photos/jkdsphotography/4061838798 links http://www.flickr.com/photos/creative_stock/3397559016
Editor's Notes
It’s not about one data model vs. another.It’s not about one storage engine vs. another.Cassandra excels at replicating data and achieving high sustained write throughput.
The right tool for the right job
Shaped by distribution model
Shaped by distribution model
Shaped by distribution model
Sparse – do not have to exist in every row.
Flexible column namingYou define the sort orderNot required to have a specific column just because another row does
Look familiar?
Arise because of distribution model, not CF model.
* Atomic @ CF row. Not isolated.* Large trans apps push down to node (shared nothing)* Guaranteeing ACID constraints across nodes is a hard problem.
OTOH, you do get a lot of things:Data redundancyVery fast writes, fast reads
Relational>formally defined>correctQuery first>not formally defined>somehow incorrectYou get some things in exchange:ScalabilityAvailabilityReplication
Relational>formally defined>correctQuery first>not formally defined>somehow incorrectYou get some things in exchange:ScalabilityAvailabilityReplication
Focus on query & analysis.B+treesUpdate once*Cassandra typically becomes IO bound before becoming CPU bound.
Not set in stone.Your application may require a different approach.
Recognize non-starters: Is my dataset going to become Very Large? Will I need to sustain high write throughput?Also, what are the common operations? Optimize CFs for those operations.
*columns sorted. Choose keys and columns.you need to think about how you plan to slice your data.Related data is close to reduce io
DenormalizeUse the disk.Don’t be afraid to create another CF that duplicates some data.
Composite column namesPainful updates of denormalized partsFast reads & insertions
Key
Normal attributes
Composite column names.Pulling in relationshipsPainful updates. Denormalization is best when data doesn’t change.