SlideShare a Scribd company logo
1 of 33
Download to read offline
Being Closer to Cassandra

Oleg Anastasyev
lead platform developer
Odnoklassniki.ru
Top 10 of World’s social networks
40M DAU, 80M MAU, 7M peak
~ 300 000 www req/sec,
20 ms render latency
>240 Gbit out
> 5 800 iron servers in 5 DCs
99.9% java

#CASSANDRAEU

* Odnoklassniki means “classmates” in english
Cassandra @
* Since 2010

- branched 0.6
- aiming at:
full operation on DC failure, scalability, ease of
operations

* Now

- 23 clusters
- 418 nodes in total
- 240 TB of stored data
- survived several DC failures

#CASSANDRAEU
Case #1. The fast

#CASSANDRAEU
Like! 103 927

#CASSANDRAEU

You and 103 927
Like! widget
* Its everywhere

- Have it on every page, dozen
- On feeds (AKA timeline)
- 3rd party websites elsewhere on internet

* Its on everything

- Pictures and Albums
- Videos
- Posts and comments
- 3rd party shared URLs

#CASSANDRAEU

Like! 103 927
Like! widget
* High load

- 1 000 000 reads/sec, 3 000 writes/sec

Like! 103 927
Hard load profile
*
- Read most
- Long tail (40% of reads are random)
- Sensitive to latency variations
- 3TB total dataset (9TB with RF) and growing
- ~ 60 billion likes for ~6bi entities

#CASSANDRAEU
Classic solution
SQL table
RefId:long

RefType:byte

UserId:long

Created

9999999999

PICTURE(2)

11111111111

11:00

to render

You and 4256

SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,?

= N >=1

(98% are NONE)
SELECT COUNT (*) WHERE RefId,RefType=?,?

= M>N

(80% are 0)
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)
#CASSANDRAEU

= N*140
Cassandra solution
LikeByRef (
refType byte,
refId bigint,
userId bigint,

LikeCount (
refType byte,
refId bigint,
likers counter,

PRIMARY KEY ( (RefType,RefId), UserId)

so, to render

PRIMARY KEY ( (RefType,RefId))

You and 4256

SELECT FROM LikeCount WHERE RefId,RefType=?,?
(80% are 0)
SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,?
(98% are NONE)

#CASSANDRAEU

= N*20%
>11 M iops
* Quick workaround ?
LikeByRef (
refType byte,
refId bigint,
userId bigint,

PRIMARY KEY ( (RefType,RefId, UserId) )
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

- Forces Order Pres Partitioner
(random not scales)

- Key range scans
- More network overhead
- Partitions count >10x, Dataset size > x2
#CASSANDRAEU
By column bloom filter
* What is does

- Includes pairs of (PartKey, ColumnKey) in
SSTable *-Filter.db

* The good

- Eliminated 98 % of reads
- Less false positives
* The bad
- They become too large
GC Promotion Failures
.. but fixable (CASSANDRA-2466)

#CASSANDRAEU
Are we there yet ?
1. COUNT()

application server
> 400

00

2. EXISTS
cassandra

- min 2 roundtrips per render (COUNT+RR)
- THRIFT is slow, esp having lot of connections
- EXISTS() is 200 Gbit/sec (140*8*1Mps*20%)
#CASSANDRAEU
Co-locate!
get() : LikeSummary

odnoklassniki-like
Remote Business Intf

Counters Cache
cassandra
Social Graph Cache

- one-nio remoting (faster than java nio)
- topology aware clients
#CASSANDRAEU
co-location wins
* Fast TOP N friend likers query

1. Take friends from graph cache
2. Check it with memory bloom filter
3. Read some until N friends found

* Custom caches

- Tuned for application
* Custom data merge logic
- ... so you can detect and resolve conflicts
#CASSANDRAEU
Listen for mutations
// Implement it
interface StoreApplyListener {
boolean preapply(String key,
ColumnFamily data);
}

// and register with CFS
store=Table.open(..)
.getColumnFamilyStore(..);
store.setListener(myListener);

* Register it

between commit logs replay and gossip

* RowMutation.apply()

extend original mutation
+ Replica, hints, ReadRepairs

#CASSANDRAEU
Like! optimized counters
* Counters cache

- Off heap (sun.misc.Unsafe)
- Compact (30M in 1G RAM)
- Read cached local node only

* Replicated cache state
-

#CASSANDRAEU

cold replica cache problem
making (NOP) mutations
less reads
long tail aware

LikeCount (
refType byte,
refId bigint,
ip inet,
counter int
PRIMARY KEY ( (RefType,RefId), ip)
Read latency variations
* CS read behavior

1. Choose 1 node for data and N for digest
2. Wait for data and digest
3. Compare and return (or RR)

* Nodes suddenly slowdown

- SEDA hiccup, commit log rotation, sudden IO

saturation, Network hiccup or partition, page
cache miss

* The bad

- You have spikes.
- You have to wait (and timeout)

#CASSANDRAEU
Read Latency leveling
* “Parallel” read handler

1. Ask all replicas for data in parallel
2. Wait for CL responses and return

* The good

- Minimal latency response
- Constant load when DC fails

* The (not so) bad

- “Additional” work and traffic

#CASSANDRAEU
More tiny tricks
* On SSD io

- Deadline IO elevator
- 64k -> 4k read request size

* HintLog

- Commit log for hints
- Wait for all hints on startup

* Selective compaction

- Compacts most read CFs more often

#CASSANDRAEU
Case #2. The fat

#CASSANDRAEU
* Messages in chats

- Last page is accessed on open
- long tail (80%) for rest

- 150 billion, 100 TB in storage
- Read most (120k reads/sec, 8k writes/sec)
#CASSANDRAEU
Messages have structure
Message (
chatId, msgId,

MessageCF (
chatId, msgId,

created, type,userIndex,deletedBy,...
text
)

data blob,
PRIMARY KEY ( chatId, msgId )

- All chat’s messages in single partition
- Single blob for message data
to reduce overhead

- The bad
Conflicting modifications can happen
(users, anti-spam, etc..)

#CASSANDRAEU
LW conflict resolution
get
get

(version:ts1, data:d1)

(version:ts1, data:d1)
write( ts1, data2 )
delete(version:ts1)
insert(version: ts2=now(), data2)
Messages (
chatId, msgId,
version timestamp,
data blob
PRIMARY KEY ( chatId, msgId, version )

write( ts1, data3 )
delete(version:ts1)
insert(version: ts3=now(), data3)
(ts2, data2)
(ts3, data3)

#CASSANDRAEU

- merged on read
Specialized cache
* Again. Because we can

- Off-heap (Unsafe)
- Caches only freshest chat page
- Saves its state to local (AKA system) CF
keys AND values
seq read, much faster startup

- In memory compression
2x more memory almost free

#CASSANDRAEU
Disk mgmt
* 4U HDDx24, up to 4TB/node

- Size tiered compaction = 4 TB sstable file
- RAID10 ? LCS ?

* Split CF to 256 pieces
* The good

- Smaller, more frequent memtable flushes
- Same compaction work
in smaller sets

- Can distribute across disks
#CASSANDRAEU
Disk Allocation Policies
* Default is

- “Take disk with most free space”
* Some disks have
- Too much read iops

* Generational policy

- Each disk has same # of same gen files
work better for HDD

#CASSANDRAEU
Case #3. The ugly
feed my Frankenstein

#CASSANDRAEU
* Chats overview

- small dataset (230GB)
- has hot set, short tail (5%)
- list reorders often
- 130k read/s, 21k write/s

#CASSANDRAEU
Conflicting updates
* List<Overview> is single blob

.. or you’ll have a lot of tombstones

* Lot of conflicts

updates of single column

* Need conflict detection
* Has merge algoritm
#CASSANDRAEU
Vector clocks
* Voldemort

- byte[] key -> byte[] value + VC
- Coordination logic on clients
- Pluggable storage engines

* Plugged

- CS 0.6 SSTables persistance
- Fronted by specialized cache
we love caches

#CASSANDRAEU
Performance
* 3 node cluster, RF = 3

- Intel Xeon CPU E5506 2.13GHz
RAM: 48Gb, 1x HDD, 1x SSD

* 8 byte key -> 1 KB byte value
* Results

- 75 k /sec reads, 15 k/ sec writes

#CASSANDRAEU
Why cassandra ?
* Reusable distributed DB components
fast persistance, gossip,
Reliable Async Messaging, Fail detectors,
Topology, Seq scans, ...

* Has structure

beyond byte[] key -> byte[] value

* Delivered promises
* Implemented in Java

#CASSANDRAEU
THANK YOU
Oleg Anastasyev
oa@odnoklassniki.ru
odnoklassniki.ru/oa
@m0nstermind

github.com/odnoklassniki
shared-memory-cache
java Off-Heap cache using shared
memory

#CASSANDRAEU

one-nio
rmi faster than java nio with fast and
compact automagic java serialization

CASSANDRASUMMITEU

More Related Content

What's hot

REDIS intro and how to use redis
REDIS intro and how to use redisREDIS intro and how to use redis
REDIS intro and how to use redisKris Jeong
 
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...Ontico
 
HandlerSocket plugin for MySQL (English)
HandlerSocket plugin for MySQL (English)HandlerSocket plugin for MySQL (English)
HandlerSocket plugin for MySQL (English)akirahiguchi
 
Мастер-класс "Логическая репликация и Avito" / Константин Евтеев, Михаил Тюр...
Мастер-класс "Логическая репликация и Avito" / Константин Евтеев,  Михаил Тюр...Мастер-класс "Логическая репликация и Avito" / Константин Евтеев,  Михаил Тюр...
Мастер-класс "Логическая репликация и Avito" / Константин Евтеев, Михаил Тюр...Ontico
 
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)Ontico
 
MySQL async message subscription platform
MySQL async message subscription platformMySQL async message subscription platform
MySQL async message subscription platformLouis liu
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsServer Density
 
Redis SoCraTes 2014
Redis SoCraTes 2014Redis SoCraTes 2014
Redis SoCraTes 2014steffenbauer
 
Nvmfs benchmark
Nvmfs benchmarkNvmfs benchmark
Nvmfs benchmarkLouis liu
 
Percona Toolkit for Effective MySQL Administration
Percona Toolkit for Effective MySQL AdministrationPercona Toolkit for Effective MySQL Administration
Percona Toolkit for Effective MySQL AdministrationMydbops
 
Базы данных. HDFS
Базы данных. HDFSБазы данных. HDFS
Базы данных. HDFSVadim Tsesko
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemakerhastexo
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXzznate
 
glance replicator
glance replicatorglance replicator
glance replicatoririx_jp
 
HandlerSocket - A NoSQL plugin for MySQL
HandlerSocket - A NoSQL plugin for MySQLHandlerSocket - A NoSQL plugin for MySQL
HandlerSocket - A NoSQL plugin for MySQLJui-Nan Lin
 
Introduction to XtraDB Cluster
Introduction to XtraDB ClusterIntroduction to XtraDB Cluster
Introduction to XtraDB Clusteryoku0825
 
Oracle cluster installation with grid and iscsi
Oracle cluster  installation with grid and iscsiOracle cluster  installation with grid and iscsi
Oracle cluster installation with grid and iscsiChanaka Lasantha
 
MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017Antonios Giannopoulos
 

What's hot (20)

REDIS intro and how to use redis
REDIS intro and how to use redisREDIS intro and how to use redis
REDIS intro and how to use redis
 
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
 
HandlerSocket plugin for MySQL (English)
HandlerSocket plugin for MySQL (English)HandlerSocket plugin for MySQL (English)
HandlerSocket plugin for MySQL (English)
 
Мастер-класс "Логическая репликация и Avito" / Константин Евтеев, Михаил Тюр...
Мастер-класс "Логическая репликация и Avito" / Константин Евтеев,  Михаил Тюр...Мастер-класс "Логическая репликация и Avito" / Константин Евтеев,  Михаил Тюр...
Мастер-класс "Логическая репликация и Avito" / Константин Евтеев, Михаил Тюр...
 
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
 
MySQL async message subscription platform
MySQL async message subscription platformMySQL async message subscription platform
MySQL async message subscription platform
 
Cassandra勉強会
Cassandra勉強会Cassandra勉強会
Cassandra勉強会
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
Redis SoCraTes 2014
Redis SoCraTes 2014Redis SoCraTes 2014
Redis SoCraTes 2014
 
Nvmfs benchmark
Nvmfs benchmarkNvmfs benchmark
Nvmfs benchmark
 
Percona Toolkit for Effective MySQL Administration
Percona Toolkit for Effective MySQL AdministrationPercona Toolkit for Effective MySQL Administration
Percona Toolkit for Effective MySQL Administration
 
Базы данных. HDFS
Базы данных. HDFSБазы данных. HDFS
Базы данных. HDFS
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemaker
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
glance replicator
glance replicatorglance replicator
glance replicator
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
HandlerSocket - A NoSQL plugin for MySQL
HandlerSocket - A NoSQL plugin for MySQLHandlerSocket - A NoSQL plugin for MySQL
HandlerSocket - A NoSQL plugin for MySQL
 
Introduction to XtraDB Cluster
Introduction to XtraDB ClusterIntroduction to XtraDB Cluster
Introduction to XtraDB Cluster
 
Oracle cluster installation with grid and iscsi
Oracle cluster  installation with grid and iscsiOracle cluster  installation with grid and iscsi
Oracle cluster installation with grid and iscsi
 
MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017
 

Similar to Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Richard Low
 
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark DataStax Academy
 
Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisJignesh Shah
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Clusterpercona2013
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDaehyeok Kim
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandrazznate
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptxJawaharPrasad3
 
Build an affordable Cloud Stroage
Build an affordable Cloud StroageBuild an affordable Cloud Stroage
Build an affordable Cloud StroageAlex Lau
 
Zing Database – Distributed Key-Value Database
Zing Database – Distributed Key-Value DatabaseZing Database – Distributed Key-Value Database
Zing Database – Distributed Key-Value Databasezingopen
 
Zing Database
Zing Database Zing Database
Zing Database Long Dao
 
Percona live linux filesystems and my sql
Percona live   linux filesystems and my sqlPercona live   linux filesystems and my sql
Percona live linux filesystems and my sqlMichael Zhang
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Clusterpercona2013
 
C* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick BransonC* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick BransonDataStax Academy
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016Tomas Vondra
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysisodsc
 

Similar to Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013 (20)

Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
 
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
 
4 use cases for C* to Scylla
4 use cases for C*  to Scylla4 use cases for C*  to Scylla
4 use cases for C* to Scylla
 
Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on Solaris
 
Real world capacity
Real world capacityReal world capacity
Real world capacity
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed Systems
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 
Build an affordable Cloud Stroage
Build an affordable Cloud StroageBuild an affordable Cloud Stroage
Build an affordable Cloud Stroage
 
Zing Database – Distributed Key-Value Database
Zing Database – Distributed Key-Value DatabaseZing Database – Distributed Key-Value Database
Zing Database – Distributed Key-Value Database
 
Zing Database
Zing Database Zing Database
Zing Database
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 
Percona live linux filesystems and my sql
Percona live   linux filesystems and my sqlPercona live   linux filesystems and my sql
Percona live linux filesystems and my sql
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
 
C* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick BransonC* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick Branson
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
 
IO Dubi Lebel
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebel
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 

More from odnoklassniki.ru

Тестирование аварий. Андрей Губа. Highload++ 2015
Тестирование аварий. Андрей Губа. Highload++ 2015Тестирование аварий. Андрей Губа. Highload++ 2015
Тестирование аварий. Андрей Губа. Highload++ 2015odnoklassniki.ru
 
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...odnoklassniki.ru
 
Распределенные системы в Одноклассниках
Распределенные системы в ОдноклассникахРаспределенные системы в Одноклассниках
Распределенные системы в Одноклассникахodnoklassniki.ru
 
Кадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
Кадры решают все, или стриминг видео в «Одноклассниках». Александр ТобольКадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
Кадры решают все, или стриминг видео в «Одноклассниках». Александр Тобольodnoklassniki.ru
 
За гранью NoSQL: NewSQL на Cassandra
За гранью NoSQL: NewSQL на CassandraЗа гранью NoSQL: NewSQL на Cassandra
За гранью NoSQL: NewSQL на Cassandraodnoklassniki.ru
 
Платформа для видео сроком в квартал. Александр Тоболь.
Платформа для видео сроком в квартал. Александр Тоболь.Платформа для видео сроком в квартал. Александр Тоболь.
Платформа для видео сроком в квартал. Александр Тоболь.odnoklassniki.ru
 
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...odnoklassniki.ru
 
Аварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
Аварийный дамп – чёрный ящик упавшей JVM. Андрей ПаньгинАварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
Аварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгинodnoklassniki.ru
 
Управление тысячами серверов в Одноклассниках. Алексей Чудов.
Управление тысячами серверов в Одноклассниках. Алексей Чудов.Управление тысячами серверов в Одноклассниках. Алексей Чудов.
Управление тысячами серверов в Одноклассниках. Алексей Чудов.odnoklassniki.ru
 
Класс!ная Cassandra
Класс!ная CassandraКласс!ная Cassandra
Класс!ная Cassandraodnoklassniki.ru
 
Незаурядная Java как инструмент разработки высоконагруженного сервера
Незаурядная Java как инструмент разработки высоконагруженного сервераНезаурядная Java как инструмент разработки высоконагруженного сервера
Незаурядная Java как инструмент разработки высоконагруженного сервераodnoklassniki.ru
 
Cистема внутренней статистики Odnoklassniki.ru
Cистема внутренней статистики Odnoklassniki.ruCистема внутренней статистики Odnoklassniki.ru
Cистема внутренней статистики Odnoklassniki.ruodnoklassniki.ru
 
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...odnoklassniki.ru
 

More from odnoklassniki.ru (13)

Тестирование аварий. Андрей Губа. Highload++ 2015
Тестирование аварий. Андрей Губа. Highload++ 2015Тестирование аварий. Андрей Губа. Highload++ 2015
Тестирование аварий. Андрей Губа. Highload++ 2015
 
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
 
Распределенные системы в Одноклассниках
Распределенные системы в ОдноклассникахРаспределенные системы в Одноклассниках
Распределенные системы в Одноклассниках
 
Кадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
Кадры решают все, или стриминг видео в «Одноклассниках». Александр ТобольКадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
Кадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
 
За гранью NoSQL: NewSQL на Cassandra
За гранью NoSQL: NewSQL на CassandraЗа гранью NoSQL: NewSQL на Cassandra
За гранью NoSQL: NewSQL на Cassandra
 
Платформа для видео сроком в квартал. Александр Тоболь.
Платформа для видео сроком в квартал. Александр Тоболь.Платформа для видео сроком в квартал. Александр Тоболь.
Платформа для видео сроком в квартал. Александр Тоболь.
 
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
 
Аварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
Аварийный дамп – чёрный ящик упавшей JVM. Андрей ПаньгинАварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
Аварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
 
Управление тысячами серверов в Одноклассниках. Алексей Чудов.
Управление тысячами серверов в Одноклассниках. Алексей Чудов.Управление тысячами серверов в Одноклассниках. Алексей Чудов.
Управление тысячами серверов в Одноклассниках. Алексей Чудов.
 
Класс!ная Cassandra
Класс!ная CassandraКласс!ная Cassandra
Класс!ная Cassandra
 
Незаурядная Java как инструмент разработки высоконагруженного сервера
Незаурядная Java как инструмент разработки высоконагруженного сервераНезаурядная Java как инструмент разработки высоконагруженного сервера
Незаурядная Java как инструмент разработки высоконагруженного сервера
 
Cистема внутренней статистики Odnoklassniki.ru
Cистема внутренней статистики Odnoklassniki.ruCистема внутренней статистики Odnoklassniki.ru
Cистема внутренней статистики Odnoklassniki.ru
 
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 

Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

  • 1. Being Closer to Cassandra Oleg Anastasyev lead platform developer Odnoklassniki.ru
  • 2. Top 10 of World’s social networks 40M DAU, 80M MAU, 7M peak ~ 300 000 www req/sec, 20 ms render latency >240 Gbit out > 5 800 iron servers in 5 DCs 99.9% java #CASSANDRAEU * Odnoklassniki means “classmates” in english
  • 3. Cassandra @ * Since 2010 - branched 0.6 - aiming at: full operation on DC failure, scalability, ease of operations * Now - 23 clusters - 418 nodes in total - 240 TB of stored data - survived several DC failures #CASSANDRAEU
  • 4. Case #1. The fast #CASSANDRAEU
  • 6. Like! widget * Its everywhere - Have it on every page, dozen - On feeds (AKA timeline) - 3rd party websites elsewhere on internet * Its on everything - Pictures and Albums - Videos - Posts and comments - 3rd party shared URLs #CASSANDRAEU Like! 103 927
  • 7. Like! widget * High load - 1 000 000 reads/sec, 3 000 writes/sec Like! 103 927 Hard load profile * - Read most - Long tail (40% of reads are random) - Sensitive to latency variations - 3TB total dataset (9TB with RF) and growing - ~ 60 billion likes for ~6bi entities #CASSANDRAEU
  • 8. Classic solution SQL table RefId:long RefType:byte UserId:long Created 9999999999 PICTURE(2) 11111111111 11:00 to render You and 4256 SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,? = N >=1 (98% are NONE) SELECT COUNT (*) WHERE RefId,RefType=?,? = M>N (80% are 0) SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId) #CASSANDRAEU = N*140
  • 9. Cassandra solution LikeByRef ( refType byte, refId bigint, userId bigint, LikeCount ( refType byte, refId bigint, likers counter, PRIMARY KEY ( (RefType,RefId), UserId) so, to render PRIMARY KEY ( (RefType,RefId)) You and 4256 SELECT FROM LikeCount WHERE RefId,RefType=?,? (80% are 0) SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,? (98% are NONE) #CASSANDRAEU = N*20%
  • 10. >11 M iops * Quick workaround ? LikeByRef ( refType byte, refId bigint, userId bigint, PRIMARY KEY ( (RefType,RefId, UserId) ) SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId) - Forces Order Pres Partitioner (random not scales) - Key range scans - More network overhead - Partitions count >10x, Dataset size > x2 #CASSANDRAEU
  • 11. By column bloom filter * What is does - Includes pairs of (PartKey, ColumnKey) in SSTable *-Filter.db * The good - Eliminated 98 % of reads - Less false positives * The bad - They become too large GC Promotion Failures .. but fixable (CASSANDRA-2466) #CASSANDRAEU
  • 12. Are we there yet ? 1. COUNT() application server > 400 00 2. EXISTS cassandra - min 2 roundtrips per render (COUNT+RR) - THRIFT is slow, esp having lot of connections - EXISTS() is 200 Gbit/sec (140*8*1Mps*20%) #CASSANDRAEU
  • 13. Co-locate! get() : LikeSummary odnoklassniki-like Remote Business Intf Counters Cache cassandra Social Graph Cache - one-nio remoting (faster than java nio) - topology aware clients #CASSANDRAEU
  • 14. co-location wins * Fast TOP N friend likers query 1. Take friends from graph cache 2. Check it with memory bloom filter 3. Read some until N friends found * Custom caches - Tuned for application * Custom data merge logic - ... so you can detect and resolve conflicts #CASSANDRAEU
  • 15. Listen for mutations // Implement it interface StoreApplyListener { boolean preapply(String key, ColumnFamily data); } // and register with CFS store=Table.open(..) .getColumnFamilyStore(..); store.setListener(myListener); * Register it between commit logs replay and gossip * RowMutation.apply() extend original mutation + Replica, hints, ReadRepairs #CASSANDRAEU
  • 16. Like! optimized counters * Counters cache - Off heap (sun.misc.Unsafe) - Compact (30M in 1G RAM) - Read cached local node only * Replicated cache state - #CASSANDRAEU cold replica cache problem making (NOP) mutations less reads long tail aware LikeCount ( refType byte, refId bigint, ip inet, counter int PRIMARY KEY ( (RefType,RefId), ip)
  • 17. Read latency variations * CS read behavior 1. Choose 1 node for data and N for digest 2. Wait for data and digest 3. Compare and return (or RR) * Nodes suddenly slowdown - SEDA hiccup, commit log rotation, sudden IO saturation, Network hiccup or partition, page cache miss * The bad - You have spikes. - You have to wait (and timeout) #CASSANDRAEU
  • 18. Read Latency leveling * “Parallel” read handler 1. Ask all replicas for data in parallel 2. Wait for CL responses and return * The good - Minimal latency response - Constant load when DC fails * The (not so) bad - “Additional” work and traffic #CASSANDRAEU
  • 19. More tiny tricks * On SSD io - Deadline IO elevator - 64k -> 4k read request size * HintLog - Commit log for hints - Wait for all hints on startup * Selective compaction - Compacts most read CFs more often #CASSANDRAEU
  • 20. Case #2. The fat #CASSANDRAEU
  • 21. * Messages in chats - Last page is accessed on open - long tail (80%) for rest - 150 billion, 100 TB in storage - Read most (120k reads/sec, 8k writes/sec) #CASSANDRAEU
  • 22. Messages have structure Message ( chatId, msgId, MessageCF ( chatId, msgId, created, type,userIndex,deletedBy,... text ) data blob, PRIMARY KEY ( chatId, msgId ) - All chat’s messages in single partition - Single blob for message data to reduce overhead - The bad Conflicting modifications can happen (users, anti-spam, etc..) #CASSANDRAEU
  • 23. LW conflict resolution get get (version:ts1, data:d1) (version:ts1, data:d1) write( ts1, data2 ) delete(version:ts1) insert(version: ts2=now(), data2) Messages ( chatId, msgId, version timestamp, data blob PRIMARY KEY ( chatId, msgId, version ) write( ts1, data3 ) delete(version:ts1) insert(version: ts3=now(), data3) (ts2, data2) (ts3, data3) #CASSANDRAEU - merged on read
  • 24. Specialized cache * Again. Because we can - Off-heap (Unsafe) - Caches only freshest chat page - Saves its state to local (AKA system) CF keys AND values seq read, much faster startup - In memory compression 2x more memory almost free #CASSANDRAEU
  • 25. Disk mgmt * 4U HDDx24, up to 4TB/node - Size tiered compaction = 4 TB sstable file - RAID10 ? LCS ? * Split CF to 256 pieces * The good - Smaller, more frequent memtable flushes - Same compaction work in smaller sets - Can distribute across disks #CASSANDRAEU
  • 26. Disk Allocation Policies * Default is - “Take disk with most free space” * Some disks have - Too much read iops * Generational policy - Each disk has same # of same gen files work better for HDD #CASSANDRAEU
  • 27. Case #3. The ugly feed my Frankenstein #CASSANDRAEU
  • 28. * Chats overview - small dataset (230GB) - has hot set, short tail (5%) - list reorders often - 130k read/s, 21k write/s #CASSANDRAEU
  • 29. Conflicting updates * List<Overview> is single blob .. or you’ll have a lot of tombstones * Lot of conflicts updates of single column * Need conflict detection * Has merge algoritm #CASSANDRAEU
  • 30. Vector clocks * Voldemort - byte[] key -> byte[] value + VC - Coordination logic on clients - Pluggable storage engines * Plugged - CS 0.6 SSTables persistance - Fronted by specialized cache we love caches #CASSANDRAEU
  • 31. Performance * 3 node cluster, RF = 3 - Intel Xeon CPU E5506 2.13GHz RAM: 48Gb, 1x HDD, 1x SSD * 8 byte key -> 1 KB byte value * Results - 75 k /sec reads, 15 k/ sec writes #CASSANDRAEU
  • 32. Why cassandra ? * Reusable distributed DB components fast persistance, gossip, Reliable Async Messaging, Fail detectors, Topology, Seq scans, ... * Has structure beyond byte[] key -> byte[] value * Delivered promises * Implemented in Java #CASSANDRAEU
  • 33. THANK YOU Oleg Anastasyev oa@odnoklassniki.ru odnoklassniki.ru/oa @m0nstermind github.com/odnoklassniki shared-memory-cache java Off-Heap cache using shared memory #CASSANDRAEU one-nio rmi faster than java nio with fast and compact automagic java serialization CASSANDRASUMMITEU