Being Closer to Cassandra

Oleg Anastasyev
lead platform developer
Odnoklassniki.ru
Top 10 of World’s social networks
40M DAU, 80M MAU, 7M peak
~ 300 000 www req/sec,
20 ms render latency
>240 Gbit out
> 5 800 iron servers in 5 DCs
99.9% java

#CASSANDRAEU

* Odnoklassniki means “classmates” in english
Cassandra @
* Since 2010

- branched 0.6
- aiming at:
full operation on DC failure, scalability, ease of
operations

* Now

- 23 clusters
- 418 nodes in total
- 240 TB of stored data
- survived several DC failures

#CASSANDRAEU
Case #1. The fast

#CASSANDRAEU
Like! 103 927

#CASSANDRAEU

You and 103 927
Like! widget
* Its everywhere

- Have it on every page, dozen
- On feeds (AKA timeline)
- 3rd party websites elsewhere on internet

* Its on everything

- Pictures and Albums
- Videos
- Posts and comments
- 3rd party shared URLs

#CASSANDRAEU

Like! 103 927
Like! widget
* High load

- 1 000 000 reads/sec, 3 000 writes/sec

Like! 103 927
Hard load profile
*
- Read most
- Long tail (40% of reads are random)
- Sensitive to latency variations
- 3TB total dataset (9TB with RF) and growing
- ~ 60 billion likes for ~6bi entities

#CASSANDRAEU
Classic solution
SQL table
RefId:long

RefType:byte

UserId:long

Created

9999999999

PICTURE(2)

11111111111

11:00

to render

You and 4256

SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,?

= N >=1

(98% are NONE)
SELECT COUNT (*) WHERE RefId,RefType=?,?

= M>N

(80% are 0)
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)
#CASSANDRAEU

= N*140
Cassandra solution
LikeByRef (
refType byte,
refId bigint,
userId bigint,

LikeCount (
refType byte,
refId bigint,
likers counter,

PRIMARY KEY ( (RefType,RefId), UserId)

so, to render

PRIMARY KEY ( (RefType,RefId))

You and 4256

SELECT FROM LikeCount WHERE RefId,RefType=?,?
(80% are 0)
SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,?
(98% are NONE)

#CASSANDRAEU

= N*20%
>11 M iops
* Quick workaround ?
LikeByRef (
refType byte,
refId bigint,
userId bigint,

PRIMARY KEY ( (RefType,RefId, UserId) )
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

- Forces Order Pres Partitioner
(random not scales)

- Key range scans
- More network overhead
- Partitions count >10x, Dataset size > x2
#CASSANDRAEU
By column bloom filter
* What is does

- Includes pairs of (PartKey, ColumnKey) in
SSTable *-Filter.db

* The good

- Eliminated 98 % of reads
- Less false positives
* The bad
- They become too large
GC Promotion Failures
.. but fixable (CASSANDRA-2466)

#CASSANDRAEU
Are we there yet ?
1. COUNT()

application server
> 400

00

2. EXISTS
cassandra

- min 2 roundtrips per render (COUNT+RR)
- THRIFT is slow, esp having lot of connections
- EXISTS() is 200 Gbit/sec (140*8*1Mps*20%)
#CASSANDRAEU
Co-locate!
get() : LikeSummary

odnoklassniki-like
Remote Business Intf

Counters Cache
cassandra
Social Graph Cache

- one-nio remoting (faster than java nio)
- topology aware clients
#CASSANDRAEU
co-location wins
* Fast TOP N friend likers query

1. Take friends from graph cache
2. Check it with memory bloom filter
3. Read some until N friends found

* Custom caches

- Tuned for application
* Custom data merge logic
- ... so you can detect and resolve conflicts
#CASSANDRAEU
Listen for mutations
// Implement it
interface StoreApplyListener {
boolean preapply(String key,
ColumnFamily data);
}

// and register with CFS
store=Table.open(..)
.getColumnFamilyStore(..);
store.setListener(myListener);

* Register it

between commit logs replay and gossip

* RowMutation.apply()

extend original mutation
+ Replica, hints, ReadRepairs

#CASSANDRAEU
Like! optimized counters
* Counters cache

- Off heap (sun.misc.Unsafe)
- Compact (30M in 1G RAM)
- Read cached local node only

* Replicated cache state
-

#CASSANDRAEU

cold replica cache problem
making (NOP) mutations
less reads
long tail aware

LikeCount (
refType byte,
refId bigint,
ip inet,
counter int
PRIMARY KEY ( (RefType,RefId), ip)
Read latency variations
* CS read behavior

1. Choose 1 node for data and N for digest
2. Wait for data and digest
3. Compare and return (or RR)

* Nodes suddenly slowdown

- SEDA hiccup, commit log rotation, sudden IO

saturation, Network hiccup or partition, page
cache miss

* The bad

- You have spikes.
- You have to wait (and timeout)

#CASSANDRAEU
Read Latency leveling
* “Parallel” read handler

1. Ask all replicas for data in parallel
2. Wait for CL responses and return

* The good

- Minimal latency response
- Constant load when DC fails

* The (not so) bad

- “Additional” work and traffic

#CASSANDRAEU
More tiny tricks
* On SSD io

- Deadline IO elevator
- 64k -> 4k read request size

* HintLog

- Commit log for hints
- Wait for all hints on startup

* Selective compaction

- Compacts most read CFs more often

#CASSANDRAEU
Case #2. The fat

#CASSANDRAEU
* Messages in chats

- Last page is accessed on open
- long tail (80%) for rest

- 150 billion, 100 TB in storage
- Read most (120k reads/sec, 8k writes/sec)
#CASSANDRAEU
Messages have structure
Message (
chatId, msgId,

MessageCF (
chatId, msgId,

created, type,userIndex,deletedBy,...
text
)

data blob,
PRIMARY KEY ( chatId, msgId )

- All chat’s messages in single partition
- Single blob for message data
to reduce overhead

- The bad
Conflicting modifications can happen
(users, anti-spam, etc..)

#CASSANDRAEU
LW conflict resolution
get
get

(version:ts1, data:d1)

(version:ts1, data:d1)
write( ts1, data2 )
delete(version:ts1)
insert(version: ts2=now(), data2)
Messages (
chatId, msgId,
version timestamp,
data blob
PRIMARY KEY ( chatId, msgId, version )

write( ts1, data3 )
delete(version:ts1)
insert(version: ts3=now(), data3)
(ts2, data2)
(ts3, data3)

#CASSANDRAEU

- merged on read
Specialized cache
* Again. Because we can

- Off-heap (Unsafe)
- Caches only freshest chat page
- Saves its state to local (AKA system) CF
keys AND values
seq read, much faster startup

- In memory compression
2x more memory almost free

#CASSANDRAEU
Disk mgmt
* 4U HDDx24, up to 4TB/node

- Size tiered compaction = 4 TB sstable file
- RAID10 ? LCS ?

* Split CF to 256 pieces
* The good

- Smaller, more frequent memtable flushes
- Same compaction work
in smaller sets

- Can distribute across disks
#CASSANDRAEU
Disk Allocation Policies
* Default is

- “Take disk with most free space”
* Some disks have
- Too much read iops

* Generational policy

- Each disk has same # of same gen files
work better for HDD

#CASSANDRAEU
Case #3. The ugly
feed my Frankenstein

#CASSANDRAEU
* Chats overview

- small dataset (230GB)
- has hot set, short tail (5%)
- list reorders often
- 130k read/s, 21k write/s

#CASSANDRAEU
Conflicting updates
* List<Overview> is single blob

.. or you’ll have a lot of tombstones

* Lot of conflicts

updates of single column

* Need conflict detection
* Has merge algoritm
#CASSANDRAEU
Vector clocks
* Voldemort

- byte[] key -> byte[] value + VC
- Coordination logic on clients
- Pluggable storage engines

* Plugged

- CS 0.6 SSTables persistance
- Fronted by specialized cache
we love caches

#CASSANDRAEU
Performance
* 3 node cluster, RF = 3

- Intel Xeon CPU E5506 2.13GHz
RAM: 48Gb, 1x HDD, 1x SSD

* 8 byte key -> 1 KB byte value
* Results

- 75 k /sec reads, 15 k/ sec writes

#CASSANDRAEU
Why cassandra ?
* Reusable distributed DB components
fast persistance, gossip,
Reliable Async Messaging, Fail detectors,
Topology, Seq scans, ...

* Has structure

beyond byte[] key -> byte[] value

* Delivered promises
* Implemented in Java

#CASSANDRAEU
THANK YOU
Oleg Anastasyev
oa@odnoklassniki.ru
odnoklassniki.ru/oa
@m0nstermind

github.com/odnoklassniki
shared-memory-cache
java Off-Heap cache using shared
memory

#CASSANDRAEU

one-nio
rmi faster than java nio with fast and
compact automagic java serialization

CASSANDRASUMMITEU

Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

  • 1.
    Being Closer toCassandra Oleg Anastasyev lead platform developer Odnoklassniki.ru
  • 2.
    Top 10 ofWorld’s social networks 40M DAU, 80M MAU, 7M peak ~ 300 000 www req/sec, 20 ms render latency >240 Gbit out > 5 800 iron servers in 5 DCs 99.9% java #CASSANDRAEU * Odnoklassniki means “classmates” in english
  • 3.
    Cassandra @ * Since2010 - branched 0.6 - aiming at: full operation on DC failure, scalability, ease of operations * Now - 23 clusters - 418 nodes in total - 240 TB of stored data - survived several DC failures #CASSANDRAEU
  • 4.
    Case #1. Thefast #CASSANDRAEU
  • 5.
  • 6.
    Like! widget * Itseverywhere - Have it on every page, dozen - On feeds (AKA timeline) - 3rd party websites elsewhere on internet * Its on everything - Pictures and Albums - Videos - Posts and comments - 3rd party shared URLs #CASSANDRAEU Like! 103 927
  • 7.
    Like! widget * Highload - 1 000 000 reads/sec, 3 000 writes/sec Like! 103 927 Hard load profile * - Read most - Long tail (40% of reads are random) - Sensitive to latency variations - 3TB total dataset (9TB with RF) and growing - ~ 60 billion likes for ~6bi entities #CASSANDRAEU
  • 8.
    Classic solution SQL table RefId:long RefType:byte UserId:long Created 9999999999 PICTURE(2) 11111111111 11:00 torender You and 4256 SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,? = N >=1 (98% are NONE) SELECT COUNT (*) WHERE RefId,RefType=?,? = M>N (80% are 0) SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId) #CASSANDRAEU = N*140
  • 9.
    Cassandra solution LikeByRef ( refTypebyte, refId bigint, userId bigint, LikeCount ( refType byte, refId bigint, likers counter, PRIMARY KEY ( (RefType,RefId), UserId) so, to render PRIMARY KEY ( (RefType,RefId)) You and 4256 SELECT FROM LikeCount WHERE RefId,RefType=?,? (80% are 0) SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,? (98% are NONE) #CASSANDRAEU = N*20%
  • 10.
    >11 M iops *Quick workaround ? LikeByRef ( refType byte, refId bigint, userId bigint, PRIMARY KEY ( (RefType,RefId, UserId) ) SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId) - Forces Order Pres Partitioner (random not scales) - Key range scans - More network overhead - Partitions count >10x, Dataset size > x2 #CASSANDRAEU
  • 11.
    By column bloomfilter * What is does - Includes pairs of (PartKey, ColumnKey) in SSTable *-Filter.db * The good - Eliminated 98 % of reads - Less false positives * The bad - They become too large GC Promotion Failures .. but fixable (CASSANDRA-2466) #CASSANDRAEU
  • 12.
    Are we thereyet ? 1. COUNT() application server > 400 00 2. EXISTS cassandra - min 2 roundtrips per render (COUNT+RR) - THRIFT is slow, esp having lot of connections - EXISTS() is 200 Gbit/sec (140*8*1Mps*20%) #CASSANDRAEU
  • 13.
    Co-locate! get() : LikeSummary odnoklassniki-like RemoteBusiness Intf Counters Cache cassandra Social Graph Cache - one-nio remoting (faster than java nio) - topology aware clients #CASSANDRAEU
  • 14.
    co-location wins * FastTOP N friend likers query 1. Take friends from graph cache 2. Check it with memory bloom filter 3. Read some until N friends found * Custom caches - Tuned for application * Custom data merge logic - ... so you can detect and resolve conflicts #CASSANDRAEU
  • 15.
    Listen for mutations //Implement it interface StoreApplyListener { boolean preapply(String key, ColumnFamily data); } // and register with CFS store=Table.open(..) .getColumnFamilyStore(..); store.setListener(myListener); * Register it between commit logs replay and gossip * RowMutation.apply() extend original mutation + Replica, hints, ReadRepairs #CASSANDRAEU
  • 16.
    Like! optimized counters *Counters cache - Off heap (sun.misc.Unsafe) - Compact (30M in 1G RAM) - Read cached local node only * Replicated cache state - #CASSANDRAEU cold replica cache problem making (NOP) mutations less reads long tail aware LikeCount ( refType byte, refId bigint, ip inet, counter int PRIMARY KEY ( (RefType,RefId), ip)
  • 17.
    Read latency variations *CS read behavior 1. Choose 1 node for data and N for digest 2. Wait for data and digest 3. Compare and return (or RR) * Nodes suddenly slowdown - SEDA hiccup, commit log rotation, sudden IO saturation, Network hiccup or partition, page cache miss * The bad - You have spikes. - You have to wait (and timeout) #CASSANDRAEU
  • 18.
    Read Latency leveling *“Parallel” read handler 1. Ask all replicas for data in parallel 2. Wait for CL responses and return * The good - Minimal latency response - Constant load when DC fails * The (not so) bad - “Additional” work and traffic #CASSANDRAEU
  • 19.
    More tiny tricks *On SSD io - Deadline IO elevator - 64k -> 4k read request size * HintLog - Commit log for hints - Wait for all hints on startup * Selective compaction - Compacts most read CFs more often #CASSANDRAEU
  • 20.
    Case #2. Thefat #CASSANDRAEU
  • 21.
    * Messages inchats - Last page is accessed on open - long tail (80%) for rest - 150 billion, 100 TB in storage - Read most (120k reads/sec, 8k writes/sec) #CASSANDRAEU
  • 22.
    Messages have structure Message( chatId, msgId, MessageCF ( chatId, msgId, created, type,userIndex,deletedBy,... text ) data blob, PRIMARY KEY ( chatId, msgId ) - All chat’s messages in single partition - Single blob for message data to reduce overhead - The bad Conflicting modifications can happen (users, anti-spam, etc..) #CASSANDRAEU
  • 23.
    LW conflict resolution get get (version:ts1,data:d1) (version:ts1, data:d1) write( ts1, data2 ) delete(version:ts1) insert(version: ts2=now(), data2) Messages ( chatId, msgId, version timestamp, data blob PRIMARY KEY ( chatId, msgId, version ) write( ts1, data3 ) delete(version:ts1) insert(version: ts3=now(), data3) (ts2, data2) (ts3, data3) #CASSANDRAEU - merged on read
  • 24.
    Specialized cache * Again.Because we can - Off-heap (Unsafe) - Caches only freshest chat page - Saves its state to local (AKA system) CF keys AND values seq read, much faster startup - In memory compression 2x more memory almost free #CASSANDRAEU
  • 25.
    Disk mgmt * 4UHDDx24, up to 4TB/node - Size tiered compaction = 4 TB sstable file - RAID10 ? LCS ? * Split CF to 256 pieces * The good - Smaller, more frequent memtable flushes - Same compaction work in smaller sets - Can distribute across disks #CASSANDRAEU
  • 26.
    Disk Allocation Policies *Default is - “Take disk with most free space” * Some disks have - Too much read iops * Generational policy - Each disk has same # of same gen files work better for HDD #CASSANDRAEU
  • 27.
    Case #3. Theugly feed my Frankenstein #CASSANDRAEU
  • 28.
    * Chats overview -small dataset (230GB) - has hot set, short tail (5%) - list reorders often - 130k read/s, 21k write/s #CASSANDRAEU
  • 29.
    Conflicting updates * List<Overview>is single blob .. or you’ll have a lot of tombstones * Lot of conflicts updates of single column * Need conflict detection * Has merge algoritm #CASSANDRAEU
  • 30.
    Vector clocks * Voldemort -byte[] key -> byte[] value + VC - Coordination logic on clients - Pluggable storage engines * Plugged - CS 0.6 SSTables persistance - Fronted by specialized cache we love caches #CASSANDRAEU
  • 31.
    Performance * 3 nodecluster, RF = 3 - Intel Xeon CPU E5506 2.13GHz RAM: 48Gb, 1x HDD, 1x SSD * 8 byte key -> 1 KB byte value * Results - 75 k /sec reads, 15 k/ sec writes #CASSANDRAEU
  • 32.
    Why cassandra ? *Reusable distributed DB components fast persistance, gossip, Reliable Async Messaging, Fail detectors, Topology, Seq scans, ... * Has structure beyond byte[] key -> byte[] value * Delivered promises * Implemented in Java #CASSANDRAEU
  • 33.
    THANK YOU Oleg Anastasyev oa@odnoklassniki.ru odnoklassniki.ru/oa @m0nstermind github.com/odnoklassniki shared-memory-cache javaOff-Heap cache using shared memory #CASSANDRAEU one-nio rmi faster than java nio with fast and compact automagic java serialization CASSANDRASUMMITEU