Being Closer to Cassandra

Oleg Anastasyev
lead platform developer
Odnoklassniki.ru
Top 10 of World’s social networks
40M DAU, 80M MAU, 7M peak
~ 300 000 www req/sec,
20 ms render latency
>240 Gbit out
> 5 ...
Cassandra @
* Since 2010

- branched 0.6
- aiming at:
full operation on DC failure, scalability, ease of
operations

* Now...
Case #1. The fast

#CASSANDRAEU
Like! 103 927

#CASSANDRAEU

You and 103 927
Like! widget
* Its everywhere

- Have it on every page, dozen
- On feeds (AKA timeline)
- 3rd party websites elsewhere on ...
Like! widget
* High load

- 1 000 000 reads/sec, 3 000 writes/sec

Like! 103 927
Hard load profile
*
- Read most
- Long tai...
Classic solution
SQL table
RefId:long

RefType:byte

UserId:long

Created

9999999999

PICTURE(2)

11111111111

11:00

to ...
Cassandra solution
LikeByRef (
refType byte,
refId bigint,
userId bigint,

LikeCount (
refType byte,
refId bigint,
likers ...
>11 M iops
* Quick workaround ?
LikeByRef (
refType byte,
refId bigint,
userId bigint,

PRIMARY KEY ( (RefType,RefId, User...
By column bloom filter
* What is does

- Includes pairs of (PartKey, ColumnKey) in
SSTable *-Filter.db

* The good

- Elimi...
Are we there yet ?
1. COUNT()

application server
> 400

00

2. EXISTS
cassandra

- min 2 roundtrips per render (COUNT+RR)...
Co-locate!
get() : LikeSummary

odnoklassniki-like
Remote Business Intf

Counters Cache
cassandra
Social Graph Cache

- on...
co-location wins
* Fast TOP N friend likers query

1. Take friends from graph cache
2. Check it with memory bloom filter
3....
Listen for mutations
// Implement it
interface StoreApplyListener {
boolean preapply(String key,
ColumnFamily data);
}

//...
Like! optimized counters
* Counters cache

- Off heap (sun.misc.Unsafe)
- Compact (30M in 1G RAM)
- Read cached local node...
Read latency variations
* CS read behavior

1. Choose 1 node for data and N for digest
2. Wait for data and digest
3. Comp...
Read Latency leveling
* “Parallel” read handler

1. Ask all replicas for data in parallel
2. Wait for CL responses and ret...
More tiny tricks
* On SSD io

- Deadline IO elevator
- 64k -> 4k read request size

* HintLog

- Commit log for hints
- Wa...
Case #2. The fat

#CASSANDRAEU
* Messages in chats

- Last page is accessed on open
- long tail (80%) for rest

- 150 billion, 100 TB in storage
- Read m...
Messages have structure
Message (
chatId, msgId,

MessageCF (
chatId, msgId,

created, type,userIndex,deletedBy,...
text
)...
LW conflict resolution
get
get

(version:ts1, data:d1)

(version:ts1, data:d1)
write( ts1, data2 )
delete(version:ts1)
inse...
Specialized cache
* Again. Because we can

- Off-heap (Unsafe)
- Caches only freshest chat page
- Saves its state to local...
Disk mgmt
* 4U HDDx24, up to 4TB/node

- Size tiered compaction = 4 TB sstable file
- RAID10 ? LCS ?

* Split CF to 256 pie...
Disk Allocation Policies
* Default is

- “Take disk with most free space”
* Some disks have
- Too much read iops

* Genera...
Case #3. The ugly
feed my Frankenstein

#CASSANDRAEU
* Chats overview

- small dataset (230GB)
- has hot set, short tail (5%)
- list reorders often
- 130k read/s, 21k write/s
...
Conflicting updates
* List<Overview> is single blob

.. or you’ll have a lot of tombstones

* Lot of conflicts

updates of s...
Vector clocks
* Voldemort

- byte[] key -> byte[] value + VC
- Coordination logic on clients
- Pluggable storage engines

...
Performance
* 3 node cluster, RF = 3

- Intel Xeon CPU E5506 2.13GHz
RAM: 48Gb, 1x HDD, 1x SSD

* 8 byte key -> 1 KB byte ...
Why cassandra ?
* Reusable distributed DB components
fast persistance, gossip,
Reliable Async Messaging, Fail detectors,
T...
THANK YOU
Oleg Anastasyev
oa@odnoklassniki.ru
odnoklassniki.ru/oa
@m0nstermind

github.com/odnoklassniki
shared-memory-cac...
Upcoming SlideShare
Loading in …5
×

Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

1,593 views
1,463 views

Published on

Odnoklassniki uses cassandra for its business data, which doesn't fit into RAM. This data is typically fast growing, frequently accessed by our users and must be always available, because it constitute our primary business as a social network. The way we use cassandra is somewhat unusual - we don't use thrift or netty based native protocol to communicate with cassandra nodes remotely. Instead, we co-locate cassandra nodes in the same JVM with business service logic, exposing not generic data manipulation, but business level interface remotely. This way, we avoid extra network roundtrips within a single business transaction and use internal calls to Cassandra classes to get information faster. Also, this helps us to create many small hacks on Cassandra's internals, making huge gains on efficiency and ease of distributed server development.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,593
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

  1. 1. Being Closer to Cassandra Oleg Anastasyev lead platform developer Odnoklassniki.ru
  2. 2. Top 10 of World’s social networks 40M DAU, 80M MAU, 7M peak ~ 300 000 www req/sec, 20 ms render latency >240 Gbit out > 5 800 iron servers in 5 DCs 99.9% java #CASSANDRAEU * Odnoklassniki means “classmates” in english
  3. 3. Cassandra @ * Since 2010 - branched 0.6 - aiming at: full operation on DC failure, scalability, ease of operations * Now - 23 clusters - 418 nodes in total - 240 TB of stored data - survived several DC failures #CASSANDRAEU
  4. 4. Case #1. The fast #CASSANDRAEU
  5. 5. Like! 103 927 #CASSANDRAEU You and 103 927
  6. 6. Like! widget * Its everywhere - Have it on every page, dozen - On feeds (AKA timeline) - 3rd party websites elsewhere on internet * Its on everything - Pictures and Albums - Videos - Posts and comments - 3rd party shared URLs #CASSANDRAEU Like! 103 927
  7. 7. Like! widget * High load - 1 000 000 reads/sec, 3 000 writes/sec Like! 103 927 Hard load profile * - Read most - Long tail (40% of reads are random) - Sensitive to latency variations - 3TB total dataset (9TB with RF) and growing - ~ 60 billion likes for ~6bi entities #CASSANDRAEU
  8. 8. Classic solution SQL table RefId:long RefType:byte UserId:long Created 9999999999 PICTURE(2) 11111111111 11:00 to render You and 4256 SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,? = N >=1 (98% are NONE) SELECT COUNT (*) WHERE RefId,RefType=?,? = M>N (80% are 0) SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId) #CASSANDRAEU = N*140
  9. 9. Cassandra solution LikeByRef ( refType byte, refId bigint, userId bigint, LikeCount ( refType byte, refId bigint, likers counter, PRIMARY KEY ( (RefType,RefId), UserId) so, to render PRIMARY KEY ( (RefType,RefId)) You and 4256 SELECT FROM LikeCount WHERE RefId,RefType=?,? (80% are 0) SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,? (98% are NONE) #CASSANDRAEU = N*20%
  10. 10. >11 M iops * Quick workaround ? LikeByRef ( refType byte, refId bigint, userId bigint, PRIMARY KEY ( (RefType,RefId, UserId) ) SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId) - Forces Order Pres Partitioner (random not scales) - Key range scans - More network overhead - Partitions count >10x, Dataset size > x2 #CASSANDRAEU
  11. 11. By column bloom filter * What is does - Includes pairs of (PartKey, ColumnKey) in SSTable *-Filter.db * The good - Eliminated 98 % of reads - Less false positives * The bad - They become too large GC Promotion Failures .. but fixable (CASSANDRA-2466) #CASSANDRAEU
  12. 12. Are we there yet ? 1. COUNT() application server > 400 00 2. EXISTS cassandra - min 2 roundtrips per render (COUNT+RR) - THRIFT is slow, esp having lot of connections - EXISTS() is 200 Gbit/sec (140*8*1Mps*20%) #CASSANDRAEU
  13. 13. Co-locate! get() : LikeSummary odnoklassniki-like Remote Business Intf Counters Cache cassandra Social Graph Cache - one-nio remoting (faster than java nio) - topology aware clients #CASSANDRAEU
  14. 14. co-location wins * Fast TOP N friend likers query 1. Take friends from graph cache 2. Check it with memory bloom filter 3. Read some until N friends found * Custom caches - Tuned for application * Custom data merge logic - ... so you can detect and resolve conflicts #CASSANDRAEU
  15. 15. Listen for mutations // Implement it interface StoreApplyListener { boolean preapply(String key, ColumnFamily data); } // and register with CFS store=Table.open(..) .getColumnFamilyStore(..); store.setListener(myListener); * Register it between commit logs replay and gossip * RowMutation.apply() extend original mutation + Replica, hints, ReadRepairs #CASSANDRAEU
  16. 16. Like! optimized counters * Counters cache - Off heap (sun.misc.Unsafe) - Compact (30M in 1G RAM) - Read cached local node only * Replicated cache state - #CASSANDRAEU cold replica cache problem making (NOP) mutations less reads long tail aware LikeCount ( refType byte, refId bigint, ip inet, counter int PRIMARY KEY ( (RefType,RefId), ip)
  17. 17. Read latency variations * CS read behavior 1. Choose 1 node for data and N for digest 2. Wait for data and digest 3. Compare and return (or RR) * Nodes suddenly slowdown - SEDA hiccup, commit log rotation, sudden IO saturation, Network hiccup or partition, page cache miss * The bad - You have spikes. - You have to wait (and timeout) #CASSANDRAEU
  18. 18. Read Latency leveling * “Parallel” read handler 1. Ask all replicas for data in parallel 2. Wait for CL responses and return * The good - Minimal latency response - Constant load when DC fails * The (not so) bad - “Additional” work and traffic #CASSANDRAEU
  19. 19. More tiny tricks * On SSD io - Deadline IO elevator - 64k -> 4k read request size * HintLog - Commit log for hints - Wait for all hints on startup * Selective compaction - Compacts most read CFs more often #CASSANDRAEU
  20. 20. Case #2. The fat #CASSANDRAEU
  21. 21. * Messages in chats - Last page is accessed on open - long tail (80%) for rest - 150 billion, 100 TB in storage - Read most (120k reads/sec, 8k writes/sec) #CASSANDRAEU
  22. 22. Messages have structure Message ( chatId, msgId, MessageCF ( chatId, msgId, created, type,userIndex,deletedBy,... text ) data blob, PRIMARY KEY ( chatId, msgId ) - All chat’s messages in single partition - Single blob for message data to reduce overhead - The bad Conflicting modifications can happen (users, anti-spam, etc..) #CASSANDRAEU
  23. 23. LW conflict resolution get get (version:ts1, data:d1) (version:ts1, data:d1) write( ts1, data2 ) delete(version:ts1) insert(version: ts2=now(), data2) Messages ( chatId, msgId, version timestamp, data blob PRIMARY KEY ( chatId, msgId, version ) write( ts1, data3 ) delete(version:ts1) insert(version: ts3=now(), data3) (ts2, data2) (ts3, data3) #CASSANDRAEU - merged on read
  24. 24. Specialized cache * Again. Because we can - Off-heap (Unsafe) - Caches only freshest chat page - Saves its state to local (AKA system) CF keys AND values seq read, much faster startup - In memory compression 2x more memory almost free #CASSANDRAEU
  25. 25. Disk mgmt * 4U HDDx24, up to 4TB/node - Size tiered compaction = 4 TB sstable file - RAID10 ? LCS ? * Split CF to 256 pieces * The good - Smaller, more frequent memtable flushes - Same compaction work in smaller sets - Can distribute across disks #CASSANDRAEU
  26. 26. Disk Allocation Policies * Default is - “Take disk with most free space” * Some disks have - Too much read iops * Generational policy - Each disk has same # of same gen files work better for HDD #CASSANDRAEU
  27. 27. Case #3. The ugly feed my Frankenstein #CASSANDRAEU
  28. 28. * Chats overview - small dataset (230GB) - has hot set, short tail (5%) - list reorders often - 130k read/s, 21k write/s #CASSANDRAEU
  29. 29. Conflicting updates * List<Overview> is single blob .. or you’ll have a lot of tombstones * Lot of conflicts updates of single column * Need conflict detection * Has merge algoritm #CASSANDRAEU
  30. 30. Vector clocks * Voldemort - byte[] key -> byte[] value + VC - Coordination logic on clients - Pluggable storage engines * Plugged - CS 0.6 SSTables persistance - Fronted by specialized cache we love caches #CASSANDRAEU
  31. 31. Performance * 3 node cluster, RF = 3 - Intel Xeon CPU E5506 2.13GHz RAM: 48Gb, 1x HDD, 1x SSD * 8 byte key -> 1 KB byte value * Results - 75 k /sec reads, 15 k/ sec writes #CASSANDRAEU
  32. 32. Why cassandra ? * Reusable distributed DB components fast persistance, gossip, Reliable Async Messaging, Fail detectors, Topology, Seq scans, ... * Has structure beyond byte[] key -> byte[] value * Delivered promises * Implemented in Java #CASSANDRAEU
  33. 33. THANK YOU Oleg Anastasyev oa@odnoklassniki.ru odnoklassniki.ru/oa @m0nstermind github.com/odnoklassniki shared-memory-cache java Off-Heap cache using shared memory #CASSANDRAEU one-nio rmi faster than java nio with fast and compact automagic java serialization CASSANDRASUMMITEU

×