Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

Being Closer to Cassandra

Oleg Anastasyev
lead platform developer
Odnoklassniki.ru

Top 10 of World’s social networks
40M DAU, 80M MAU, 7M peak
~ 300 000 www req/sec,
20 ms render latency
>240 Gbit out
> 5 800 iron servers in 5 DCs
99.9% java

#CASSANDRAEU

* Odnoklassniki means “classmates” in english

Cassandra @
* Since 2010

- branched 0.6
- aiming at:
full operation on DC failure, scalability, ease of
operations

* Now

- 23 clusters
- 418 nodes in total
- 240 TB of stored data
- survived several DC failures

#CASSANDRAEU

Case #1. The fast

#CASSANDRAEU

Like! 103 927

#CASSANDRAEU

You and 103 927

Like! widget
* Its everywhere

- Have it on every page, dozen
- On feeds (AKA timeline)
- 3rd party websites elsewhere on internet

* Its on everything

- Pictures and Albums
- Videos
- Posts and comments
- 3rd party shared URLs

#CASSANDRAEU

Like! 103 927

Like! widget
* High load

- 1 000 000 reads/sec, 3 000 writes/sec

Like! 103 927
Hard load proﬁle
*
- Read most
- Long tail (40% of reads are random)
- Sensitive to latency variations
- 3TB total dataset (9TB with RF) and growing
- ~ 60 billion likes for ~6bi entities

#CASSANDRAEU

Classic solution
SQL table
RefId:long

RefType:byte

UserId:long

Created

9999999999

PICTURE(2)

11111111111

11:00

to render

You and 4256

SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,?

= N >=1

(98% are NONE)
SELECT COUNT (*) WHERE RefId,RefType=?,?

= M>N

(80% are 0)
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)
#CASSANDRAEU

= N*140

Cassandra solution
LikeByRef (
refType byte,
refId bigint,
userId bigint,

LikeCount (
refType byte,
refId bigint,
likers counter,

PRIMARY KEY ( (RefType,RefId), UserId)

so, to render

PRIMARY KEY ( (RefType,RefId))

You and 4256

SELECT FROM LikeCount WHERE RefId,RefType=?,?
(80% are 0)
SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,?
(98% are NONE)

#CASSANDRAEU

= N*20%

>11 M iops
* Quick workaround ?
LikeByRef (
refType byte,
refId bigint,
userId bigint,

PRIMARY KEY ( (RefType,RefId, UserId) )
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

- Forces Order Pres Partitioner
(random not scales)

- Key range scans
- More network overhead
- Partitions count >10x, Dataset size > x2
#CASSANDRAEU

By column bloom ﬁlter
* What is does

- Includes pairs of (PartKey, ColumnKey) in
SSTable *-Filter.db

* The good

- Eliminated 98 % of reads
- Less false positives
* The bad
- They become too large
GC Promotion Failures
.. but ﬁxable (CASSANDRA-2466)

#CASSANDRAEU

Are we there yet ?
1. COUNT()

application server
> 400

00

2. EXISTS
cassandra

- min 2 roundtrips per render (COUNT+RR)
- THRIFT is slow, esp having lot of connections
- EXISTS() is 200 Gbit/sec (140*8*1Mps*20%)
#CASSANDRAEU

Co-locate!
get() : LikeSummary

odnoklassniki-like
Remote Business Intf

Counters Cache
cassandra
Social Graph Cache

- one-nio remoting (faster than java nio)
- topology aware clients
#CASSANDRAEU

co-location wins
* Fast TOP N friend likers query

1. Take friends from graph cache
2. Check it with memory bloom ﬁlter
3. Read some until N friends found

* Custom caches

- Tuned for application
* Custom data merge logic
- ... so you can detect and resolve conﬂicts
#CASSANDRAEU

Listen for mutations
// Implement it
interface StoreApplyListener {
boolean preapply(String key,
ColumnFamily data);
}

// and register with CFS
store=Table.open(..)
.getColumnFamilyStore(..);
store.setListener(myListener);

* Register it

between commit logs replay and gossip

* RowMutation.apply()

extend original mutation
+ Replica, hints, ReadRepairs

#CASSANDRAEU

Like! optimized counters
* Counters cache

- Off heap (sun.misc.Unsafe)
- Compact (30M in 1G RAM)
- Read cached local node only

* Replicated cache state
-

#CASSANDRAEU

cold replica cache problem
making (NOP) mutations
less reads
long tail aware

LikeCount (
refType byte,
refId bigint,
ip inet,
counter int
PRIMARY KEY ( (RefType,RefId), ip)

Read latency variations
* CS read behavior

1. Choose 1 node for data and N for digest
2. Wait for data and digest
3. Compare and return (or RR)

* Nodes suddenly slowdown

- SEDA hiccup, commit log rotation, sudden IO

saturation, Network hiccup or partition, page
cache miss

* The bad

- You have spikes.
- You have to wait (and timeout)

#CASSANDRAEU

Read Latency leveling
* “Parallel” read handler

1. Ask all replicas for data in parallel
2. Wait for CL responses and return

* The good

- Minimal latency response
- Constant load when DC fails

* The (not so) bad

- “Additional” work and traffic

#CASSANDRAEU

More tiny tricks
* On SSD io

- Deadline IO elevator
- 64k -> 4k read request size

* HintLog

- Commit log for hints
- Wait for all hints on startup

* Selective compaction

- Compacts most read CFs more often

#CASSANDRAEU

Case #2. The fat

#CASSANDRAEU

* Messages in chats

- Last page is accessed on open
- long tail (80%) for rest

- 150 billion, 100 TB in storage
- Read most (120k reads/sec, 8k writes/sec)
#CASSANDRAEU

Messages have structure
Message (
chatId, msgId,

MessageCF (
chatId, msgId,

created, type,userIndex,deletedBy,...
text
)

data blob,
PRIMARY KEY ( chatId, msgId )

- All chat’s messages in single partition
- Single blob for message data
to reduce overhead

- The bad
Conﬂicting modiﬁcations can happen
(users, anti-spam, etc..)

#CASSANDRAEU

LW conﬂict resolution
get
get

(version:ts1, data:d1)

(version:ts1, data:d1)
write( ts1, data2 )
delete(version:ts1)
insert(version: ts2=now(), data2)
Messages (
chatId, msgId,
version timestamp,
data blob
PRIMARY KEY ( chatId, msgId, version )

write( ts1, data3 )
delete(version:ts1)
insert(version: ts3=now(), data3)
(ts2, data2)
(ts3, data3)

#CASSANDRAEU

- merged on read

Specialized cache
* Again. Because we can

- Off-heap (Unsafe)
- Caches only freshest chat page
- Saves its state to local (AKA system) CF
keys AND values
seq read, much faster startup

- In memory compression
2x more memory almost free

#CASSANDRAEU

Disk mgmt
* 4U HDDx24, up to 4TB/node

- Size tiered compaction = 4 TB sstable ﬁle
- RAID10 ? LCS ?

* Split CF to 256 pieces
* The good

- Smaller, more frequent memtable ﬂushes
- Same compaction work
in smaller sets

- Can distribute across disks
#CASSANDRAEU

Disk Allocation Policies
* Default is

- “Take disk with most free space”
* Some disks have
- Too much read iops

* Generational policy

- Each disk has same # of same gen ﬁles
work better for HDD

#CASSANDRAEU

Case #3. The ugly
feed my Frankenstein

#CASSANDRAEU

* Chats overview

- small dataset (230GB)
- has hot set, short tail (5%)
- list reorders often
- 130k read/s, 21k write/s

#CASSANDRAEU

Conflicting updates
* List<Overview> is single blob

.. or you’ll have a lot of tombstones

* Lot of conflicts

updates of single column

* Need conflict detection
* Has merge algoritm
#CASSANDRAEU

Vector clocks
* Voldemort

- byte[] key -> byte[] value + VC
- Coordination logic on clients
- Pluggable storage engines

* Plugged

- CS 0.6 SSTables persistance
- Fronted by specialized cache
we love caches

#CASSANDRAEU

Performance
* 3 node cluster, RF = 3

- Intel Xeon CPU E5506 2.13GHz
RAM: 48Gb, 1x HDD, 1x SSD

* 8 byte key -> 1 KB byte value
* Results

- 75 k /sec reads, 15 k/ sec writes

#CASSANDRAEU

Why cassandra ?
* Reusable distributed DB components
fast persistance, gossip,
Reliable Async Messaging, Fail detectors,
Topology, Seq scans, ...

* Has structure

beyond byte[] key -> byte[] value

* Delivered promises
* Implemented in Java

#CASSANDRAEU

THANK YOU
Oleg Anastasyev
oa@odnoklassniki.ru
odnoklassniki.ru/oa
@m0nstermind

github.com/odnoklassniki
shared-memory-cache
java Off-Heap cache using shared
memory

#CASSANDRAEU

one-nio
rmi faster than java nio with fast and
compact automagic java serialization

CASSANDRASUMMITEU

Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

More Related Content

What's hot

Similar to Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

More from odnoklassniki.ru

Recently uploaded

Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013