Cassandra : writes Tunable Client 00 consistency level: CL = ZERO CL = ONE THRIFT CL = QUORUM CL = ALL Row Mutationerror if# writes < CL 192 64 Hint Storage 128
Cassandra : reads Client 00 Data resolved result Hash 192 64 Wrong HashTunable consistency:CL = ONECL = QUORUM BackgroundCL = ALL Read Repair 128
Cassandra: data model Keyspace (“Database”) 1..* ColumnFamilyStore (“Table”) << dynamic >> 0..* ColumnColumnFamily 0..*(“Row”) name : byte << dynamic >> value : bytekey : String timestamp : long
Cassandra: local write(s) name value ts Write (Key, Column) Commit Log Memtable Flusher Thread ﬂushes SSTable 1 SSTable 2 SSTable 3 SSTable 4 Merge-Sorted Compaction Thread SSTable 5Writes are always sequential!
Cassandra: local read(s) name value timesta (part of) data mp resolve Memtable SSTable 1 SSTable 4 SSTable 5• get(Key, columnNames)• slice(Key, from, count, direction)• key_range(from,count)
SSTable anatomy SSTable-5-Filter.db SSTable-5-Index.db SSTable-5-Data.db Bloom Filter Row Columns Key => Offset Per Row Bloom Filter“Row May Exist” Per Row Index False Positives So: - ZERO reads, if row is missing AND you’re lucky - 1 seek and read of index, if row is missing AND you’re not - 2 logical seeks per read (for small rows) - 3 logical seeks per read (for large ones)
Conﬂict ResolutionSSTable “AccStatements-3456” MemtableRowKey = “Oleg_Anastasyev” RowKey = “Oleg_Anastasyev”ColumnFamily = “AccountStatements” ColumnFamily = “AccountStatements”Column=”LV05HABA95142357516”Value= $1,000,000 vs Column=”LV05HABA95142357516” Value= $10 Which data is correct ?
Conﬂict ResolutionSSTable “AccStatements-3456” MemtabeRowKey = “Oleg_Anastasyev” RowKey = “Oleg_Anastasyev”ColumnFamily = “AccountStatements” ColumnFamily = “AccountStatements”Column=”LV05HABA95142357516”Value= $1,000,000 vs Column=”LV05HABA95142357516” Value= $10Timestamp = 1.1.2012 00:00:00 Timestamp=2.2.2011 00:00:00 The one with latest timestamp. period.
Missed Update problemClient 1 Client 21. Read AccountStatement for Key=”Oleg” 1. Read AccountStatement for(got $10, TS=12:00:00) Key=”Oleg” (got $10, TS=12:00:00)2. Deposit $1,000,0003. Save Key=”Oleg”, 2. Withdraw $1 Value=$1,000,010 TS=12:00:01.00 3. Save Key=”Oleg”, Value=$9 TS=12:00:01.005 Distributed counters (since 0.7) ! No generic solution for Read-Modify-Update !
Intro SummaryPros: Cons:• No SPoF • No ACID, No rollbacks• Focus on Availability • No conﬂict detection• High and Stable Write performance • No locks• On the ﬂy cluster expansion • NoSQL => Think ahead for queries• Very efﬁcient missing row read• No locks• No backups necessary• Instant Multi DC
Ok. Here is the problem• PhotoMarks tablePhotoId:long UserId:long OwnerId:long Mark:float timestamp9999999999 1111 2222 1.0f 11:00 – 2 billion shows a day (~ 50 000 on peak second) – 100 mi new marks on 10 M new photos a day ( ~ 2500 a second ) – 1.5 Tb of data + 800 Gb indexes and growing + 3 Gb per day• Query patterns – EXISTS ( PhotoId=?, UserId=?) - for 98% of calls answer is “NOT EXISTS” – SUM(), AVG() on PhotoId=? -- what are totals on photo ? – OwnerId=? ORDER BY Timestamp DESC -- who marked my photos ? – COUNT(*) OwnerId=? AND Timestamp>? -- how many new marks are ? – UserId=? --for cleanup
And the trouble is– 32 SQL clusters are serving it (not counting stanby’s)– ... and they are close to their capacity in CPU, disk array iops
Simple solutions ?• Add more SQL nodes – There are already 32 of them, add more 32 => 64 – Expensive (hardware + software) – Extension is offline and a lot of manual work – Repeat in half a year ( => 128 => 256 )• Memory cache – High NOT EXISTS queries render LRU cache useless – So you have to cache 100% of rows – 1.5Tb of RAM is not cheap – and you need 2-3x more for fault tolerance and queries
Cassandra !• Leveraging Pros – Cheap NOT EXISTS ( bloom ﬁlter powered ) – CFs are enough to model this – Data are stored on disk together (most queries are 2 disk reads) – Cold dataset is stored on disk – On the ﬂy expansion of cluster – Strong Availability• Not hitting Cons – No ACID requirements – Eventual Consistency is ok – Marks are rarely changed and never changed concurrently – Have time for major compaction
Data modelMarksByPhotoMarksByOwner (FreshFirst order)MarksUserIndex
Data modelMarksByPhotoKey Column Column ValuephotoId:String userId:byte ownerUserId+mark+time : byte – SUM(), AVG() on PhotoId=? – EXISTS ( PhotoId=?, UserId=?) 98% calls => “NOT EXISTS”- We want no disk reads here, Cap!- But cassandra need to read columns...
#1: Column Bloom Filter• How ?– Stores (Key, Column name) pairs in SSTable *-Filter.db• Pros– No disk access on NOT EXISTS– ... i.e. 98% of operations are memory only– Larger bloom ﬁlter => less false positives• Cons– Bloom ﬁlters became large - 100s of MBytes– .. lead to GC Promotion Failures (originally it was implemented as single long)– ﬁxed it (CASSANDRA-2466)
Proﬁt!• New system– 8 cassandra nodes ( instead of 32 SQLs )– RF =2– Every row stored 2.5 times (in MarksByPhoto, ByOwner and UserIndex CFs)– ~ 10 TB of data– which is > 1TB per every single node• :-( still– A lot of data per node => no use for SSD– Major compaction yields 400 - 500 Gb SSTable ﬁles– .. which are unmanageable– .. lead to full disk cache invalidation as soon as compaction ﬁnishes– .. cannot be split across disks (RAID 10 is SLOWER)– .. but we need more spindles
#2: Split em !• How ?– Pre-partition every CF to 256 smaller CFs, based on key low byte e.g. MarksByPhoto_00, _01, _FF– Now every node has 32*RF(2)=64 smaller CFs of size 5-10 Gb each– Made memtable ﬂush every 2Mb– Made 10 separate data volumes + RAID 10 for commit log device• Pros– Even load distribution across disks ( and ﬁxed it to be plain RR )– Cheap spinning disks– No massive disk cache invalidation on major compaction– 2Mb ﬂushes do not stress disks– More parallelization abilities (e.g. we reduced startup time from 30 min -> 3 )
What happens ?• On node failure – Additional load on live node (THRIFT CPU, Hint storage) – 300+ clients want to establish 20-30 connections each in 1 second (and thrift is slow on this) – ... Clients decide secondary replica is dead as well – ... No availability on part of keys ;-(• Inconvenient – You need to watch max pool sizes are 2x as large as necessary – Ensure each node capacity is > 2x
Own replication• How it works ? See source code at github/odnoklassniki Look for RackAwareOdklEvenStrategy• What it does ? – All Keys of every single node are replicated across ALL nodes from other DCs (and not only on its ring neighbors)• Proﬁt! – Every single node failure adds a tiny fraction of load to others
Some problems left application servers PhotoMarksDAOImpl THRIFT +hector application servers cassandra cluster application servers• THRIFT is slow, esp on connect even newest one from cassandra 1.1 is slower on communication and ser/deser than our BLOS; employs funny IDL (see slideshare.net/m0nstermind/java-13078132)• DTO <->THRIFT <-> CS translation• Multiple roundtrips per transaction
Meet odnoklassniki-like application servers PhotoMarksDAOImpl application servers LikeS BLOS +hector application servers cassandra cluster application servers• All BL is co-located on CS JVM => much less net roundtrips; less partial failure probability; DTO <-> CS direct translation• App specialized caches => faster, less RAM, store DTO, own policy• CF Data listeners => custom replicas merge logic• Row processors => (custom) bulk data processing• Direct Access to in memory bloom ﬁlters => fast friends like it too
That’s it !Thank you ! Oleg Anastasyev firstname.lastname@example.org odnoklassniki.ru/oa github.com/odnoklassniki/apache-cassandra cassandra.apache.org
#3: Parallel readParallelReadResolver:1. asks all replicas for data 002. waits for # of responds CL requires3. resolves and returns to client4. waits for the rest in consistency thread Slow Data Data 192 64Pros:• Unnoticed single node problems Wrong Data• Response time stabilityCons:• More trafﬁc Background (negligible for small data) Read Repair 128
Data modelMarksByPhotoMarksByOwner (FreshFirst order)MarksUserIndex
Data modelMarksByOwner (FreshFirst order)Key Column name Column ValueownerId:String time+photoId+userId : byte mark : byte time+photoId+userId : byte mark : byte ... ... – OwnerId=? ORDER BY Timestamp DESC -- who marked my photos ? – COUNT(*) OwnerId=? AND Timestamp>? -- how many new marks are ?