SlideShare a Scribd company logo
1 of 50
Download to read offline
Distributed Systems @ OK.RU
Oleg Anastasyev
@m0nstermind
oa@ok.ru
1. Absolutely reliable network
2. with negligible Latency
3. and practically unlimited Bandwidth
4. It is homogenous
5. Nobody can break into our LAN
6. Topology changes are unnoticeable
7. All managed by single genius admin
8. So data transport cost is zero now
2
OK.ru has come to:
1. Absolutely reliable network
2. with negligible Latency
3. and practically unlimited Bandwidth
4. It is homogenous (same HW and hop cnt to every server)
5. Nobody can break into our LAN
6. Topology changes are unnoticeable
7. All managed by single genius admin
8. So data transport cost is zero now
3
Fallacies of distributed computing
https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
[Peter Deutsch, 1994; James Gosling 1997]
4
4
Datacenters
150
distinct
microservices
8000
iron servers
OK.RU has come to:
5
hardware
engineers
network
engineers
operations
developers
6
My friends page
1. Retrieve friends ids
2. Filter by friendship type
3. Apply black list
4. Resolve ids to profiles
5. Sort profiles
6. Retrieve stickers
7. Calculate summaries
7
The Simple WayTM
SELECT * FROM friendlist, users 

WHERE userId=? AND f.kind=? AND u.name LIKE ?
AND NOT EXISTS( SELECT * FROM blacklist …)
…
• Friendships
• 12 billions of edges, 300GB
• 500 000 requests per sec
8
Simple ways don't work
• User profiles
• > 350 millions,
• 3 500 000 requests/sec, 50 Gbit/sec
9
How stuff works
web frontend API frontend
app server
one-graph user-cache black-list
microservices
10
Micro-service dissected
Remote interface
Business logic, caches
[ Local storage ]
1 JVM
11
Micro-service dissected
Remote interface
https://github.com/odnoklassniki/one-nio
interface GraphService extends RemoteService {
@RemoteMethod
long[] getFriendsByFilter(@Partition long vertexId, long relationMask);
}
interface UserCache {

@RemoteMethod
User getUserById(long id);
}
12
App Server code
https://github.com/odnoklassniki/one-nio
long []friendsIds = graphService.getFriendsByFilter(userId, mask);
List<User> users = new ArrayList<Long>(friendsIds.length);
for (long id : friendsIds) {
if(blackList.isAllowed(userId,id)) {
users.add(userCache.getUserById(id));
}
}
…
return users;
• Partition by this parameter value
• Using partitioning strategy
• long id -> int partitionId(id) -> node1,node2,…
• Strategies can be different
• Cassandra ring, Voldemort partitions
• or …
13
interface GraphService extends RemoteService {
@RemoteMethod
long[] getFriendsByFilter(@Partition long vertexId, long relationMask);
}
14
Weighted quadrant
p = id % 16
p = 0
p = 15
p = 1
N01 N02 N03 . . . 019 020
W=1
W=100
N11
node = wrr(p)
SET
15
A coding issue
https://github.com/odnoklassniki/one-nio
long []friendsIds = graphService.getFriendsByFilter(userId, mask);
List<User> users = new ArrayList<Long>(friendsIds.length);
for (long id : friendsIds) {
if(blackList.isAllowed(userId,id)) {
users.add(userCache.getUserById(id));
}
}
…
return users;
16
latency 

= 1.0ms * 2 reqs * 200 friends

= 400 ms

A roundtrip price
0.1-0.3 ms
0.7-1.0 ms
remote datacenter
* this price is tightly coupled with the specific infrastructure and frameworks
10k friends latency = 20 seconds
17
Batch requests to the rescue
public interface UserCache {

@RemoteMethod( split = true )
Collection<User> getUsersByIds(long[] keys);
}
long []friendsIds = graphService.getFriendsByFilter(userId, mask);


friendsIds = blackList.filterAllowed(userId, friendsIds );
List<User> users = userCache.getUsersByIds(friendsIds);
…
return users;
18
split & merge
split ( ids by p )
-> ids0, ids1
p = 0
p = 1
N01 N02 N03 . . .
N11
ids0
ids1
users = merge (users0, users1)
19
1. Client crash
2. Server crash
3. Request omission
4. Response omission
5. Server timeout
6. Invalid value response
7. Arbitrary failure
What could possibly fail ?
Failures
Distributed systems at OK.RU
• We can not prevent failures - only mask them
• If a Failure can occur it will occur
• Redundancy is a must to mask failures
• Information ( error correction codes )
• Hardware (replicas, substitute hardware)
• Time (transactions, retries)
21
What to do with failures ?
22
What happened to transaction ?
Don’t give up!
Must retry !
Must give up! 

Don't retry !
? ?
Add Friend
• Client does not really know
• What client can do ?
• Don’t make any guarantees.
• Never retry. At Most Once.
• Always retry. At Least Once.
23
Was friendship succeeded ?
1. Transaction in ACID database
• single master, success is atomic (either yes or no)
• atomic rollback is possible
2. Cache cluster refresh
• many replicas, no master
• no rollback, partial failures are possible
24
Making new friendship
• Operation can be reapplied multiple times with same result
• e.g.: read, Set.add(), Math.max(x,y)
• Atomic change with order and dup control

25
Idempotence
“Always retry” policy can be applied

only on

Idempotent Operations
https://en.wikipedia.org/wiki/Idempotence
26
Idempotence in ACID database
Make friends
wait; timeout
Make friends (retry)
Friendship, peace and bubble gum !
Already friends ?
No, let’s make it !
Already friends ?
Yes, NOP !
27
Sequencing
MakeFriends (OpId)
Made friends!
Is Dup (OpId) ?
No, making changes
OpId := Generate()
Generate() examples:
• OpId+=1
• OpId=currentTimeMillis()
• OpId=TimeUUID
http://johannburkard.de/software/uuid/
1. Transaction in ACID database
• single master, success is atomic (either yes or no)
• atomic rollback is possible
2. Cache cluster refresh
• many replicas, no master
• no rollback, partial failures are possible
28
Making new friendship
29
Cache cluster refresh
add(Friend)
p = 0 N01 N02 N03 . . .
But replicas state will diverge otherwise
Retries are meaningless
• Background data sync process
• Reads updated records from ACID store



SELECT * FROM users WHERE modified > ?
• Applies them into its memory
• Loads updates on node startup
• Retry can be omitted then

30
Syncing cache from DB
31
Death by timeout
GC
Make Friends
wait; timeout
thread pool 

exhausted
1. Clients stop sending requests to server
After X continuous failures for the last second
2. Clients monitor server availability
In background, once a minute
3. And turn it back on
32
Server cut-off
33
Death by slowing down
Avg = 1.5ms
Max = 1.5c
24 cpu cores
Cap = 24,000 ops
Choose 2.4ms timeout ?
Cut it off from client if latency avg > 2.4ms ?
Avg = 24ms
Max = 1.5s
24 cpu cores
Cap = 1,000 ops
10,000 ops
34
Speculative retry
Idemponent Op
wait; timeout
Retry
Result Response
• Makes requests to replicas before timeout
• Better 99%, even average latencies
• More stable system
• Not always applicable:
• Idempotent ops, additional load, traffic (to consider)
• Can be balanced: always, >avg, >99p
35
Speculative retry
More failures !
Distributed systems @ OK.RU
• Excessive load
• Excessive paranoia
• Bugs
• Human error
• Massive outages
37
All replicas failure
38
Use of non-authoritative datasources,
degrade consistency
Use of incomplete data in UI,
partial feature degradation

Single feature full degradation
Degrade (gracefully) !
39
The code
interface UserCache {

@RemoteMethod
Distributed<Collection<User>> getUsersByIds(long[] keys);
}
interface Distributed<D>
{
boolean isInconsistency();
D getData();
}
class UserCacheStub implements UserCache {


Distributed<Collection<User>> getUsersByIds(long[] keys) {
return Distributed.inconsistent();
}
}
Resilience testing
Distributed systems at OK.RU
41
The product you make
Operations in production env
What to test for failure ?
“Standard” products - with special care !
• What is does:
• Detects network connections between servers
• Disables them (iptables drop)
• Runs auto tests
• What we check
• No crashes, nice UI messages are rendered
• Server does start and can serve requests
42
The product we make : “Guerrilla”
Production diagnostics
Distributed systems at OK.RU
• To know an accident exists. Fast.
• To track down to the source of accident. Fast.
• To prevent accidents before they happen.
44
Why
• Zabbix
• Cacti
• Operational metrics
• Names od operations, e.g. “Graph.getFriendsByFilter”
• Call count, their success or failure
• Latency of calls
45
Is (will) there be accident ?
• Current metrics and trends
• Aggregated call and failure counts
• Aggregated latencies
• Average, Max
• Percentiles 50,75,98,99,99.9
46
What charts show to us
47
More charts
48
Anomaly detection
• The possibilities for failure in distributed systems are endless
• Don't “prevent”, but mask failures through redundancy
• Degrade gracefully on unmask-able failure
• Test failures
• Production diagnostics are key to failure detection and prevention
49
Short summary
50 Distributed Systems at OK.RU
slideshare.net/m0nstermind
https://v.ok.ru/publishing.html
http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf
Notes on Theory of Distributed Systems CS 465/565: 

Spring 2014
James Aspnes
Try these links for more

More Related Content

What's hot

スローダウン、ハングを一発解決 スレッドダンプはトラブルシューティングの味方 #wlstudy
スローダウン、ハングを一発解決 スレッドダンプはトラブルシューティングの味方 #wlstudyスローダウン、ハングを一発解決 スレッドダンプはトラブルシューティングの味方 #wlstudy
スローダウン、ハングを一発解決 スレッドダンプはトラブルシューティングの味方 #wlstudy
Yusuke Yamamoto
 
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Ontico
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
Charles Nutter
 
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
Ontico
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
Tier1 App
 
Lucene revolution 2011
Lucene revolution 2011Lucene revolution 2011
Lucene revolution 2011
Takahiko Ito
 

What's hot (20)

HandlerSocket plugin for MySQL (English)
HandlerSocket plugin for MySQL (English)HandlerSocket plugin for MySQL (English)
HandlerSocket plugin for MySQL (English)
 
Do we need Unsafe in Java?
Do we need Unsafe in Java?Do we need Unsafe in Java?
Do we need Unsafe in Java?
 
Python twisted
Python twistedPython twisted
Python twisted
 
Disruptor
DisruptorDisruptor
Disruptor
 
Down to Stack Traces, up from Heap Dumps
Down to Stack Traces, up from Heap DumpsDown to Stack Traces, up from Heap Dumps
Down to Stack Traces, up from Heap Dumps
 
HandlerSocket - A NoSQL plugin for MySQL
HandlerSocket - A NoSQL plugin for MySQLHandlerSocket - A NoSQL plugin for MySQL
HandlerSocket - A NoSQL plugin for MySQL
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
How to cook lettuce @Java casual
How to cook lettuce @Java casualHow to cook lettuce @Java casual
How to cook lettuce @Java casual
 
スローダウン、ハングを一発解決 スレッドダンプはトラブルシューティングの味方 #wlstudy
スローダウン、ハングを一発解決 スレッドダンプはトラブルシューティングの味方 #wlstudyスローダウン、ハングを一発解決 スレッドダンプはトラブルシューティングの味方 #wlstudy
スローダウン、ハングを一発解決 スレッドダンプはトラブルシューティングの味方 #wlstudy
 
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
 
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
 
Java и Linux — особенности эксплуатации / Алексей Рагозин (Дойче Банк)
Java и Linux — особенности эксплуатации / Алексей Рагозин (Дойче Банк)Java и Linux — особенности эксплуатации / Алексей Рагозин (Дойче Банк)
Java и Linux — особенности эксплуатации / Алексей Рагозин (Дойче Банк)
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
 
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018 Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
 
Базы данных. HDFS
Базы данных. HDFSБазы данных. HDFS
Базы данных. HDFS
 
Introduction httpClient on Java11 / Java11時代のHTTPアクセス再入門
Introduction httpClient on Java11 / Java11時代のHTTPアクセス再入門Introduction httpClient on Java11 / Java11時代のHTTPアクセス再入門
Introduction httpClient on Java11 / Java11時代のHTTPアクセス再入門
 
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
 
Node.js in production
Node.js in productionNode.js in production
Node.js in production
 
Lucene revolution 2011
Lucene revolution 2011Lucene revolution 2011
Lucene revolution 2011
 

Viewers also liked

Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
odnoklassniki.ru
 
Distributed Middleware Reliability & Fault Tolerance Support in System S
Distributed Middleware Reliability & Fault Tolerance Support in System SDistributed Middleware Reliability & Fault Tolerance Support in System S
Distributed Middleware Reliability & Fault Tolerance Support in System S
Harini Sirisena
 
Тестирование аварий. Андрей Губа. Highload++ 2015
Тестирование аварий. Андрей Губа. Highload++ 2015Тестирование аварий. Андрей Губа. Highload++ 2015
Тестирование аварий. Андрей Губа. Highload++ 2015
odnoklassniki.ru
 
Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked...
Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked...Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked...
Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked...
David Freitas
 

Viewers also liked (12)

тестирование распределенных систем
тестирование распределенных системтестирование распределенных систем
тестирование распределенных систем
 
Распределенные системы в Одноклассниках
Распределенные системы в ОдноклассникахРаспределенные системы в Одноклассниках
Распределенные системы в Одноклассниках
 
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
 
Distributed Middleware Reliability & Fault Tolerance Support in System S
Distributed Middleware Reliability & Fault Tolerance Support in System SDistributed Middleware Reliability & Fault Tolerance Support in System S
Distributed Middleware Reliability & Fault Tolerance Support in System S
 
Тестирование аварий. Андрей Губа. Highload++ 2015
Тестирование аварий. Андрей Губа. Highload++ 2015Тестирование аварий. Андрей Губа. Highload++ 2015
Тестирование аварий. Андрей Губа. Highload++ 2015
 
Класс!ная Cassandra
Класс!ная CassandraКласс!ная Cassandra
Класс!ная Cassandra
 
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
 
Distributed Operating System_4
Distributed Operating System_4Distributed Operating System_4
Distributed Operating System_4
 
Communications is distributed systems
Communications is distributed systemsCommunications is distributed systems
Communications is distributed systems
 
Patterns for distributed systems
Patterns for distributed systemsPatterns for distributed systems
Patterns for distributed systems
 
Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked...
Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked...Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked...
Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked...
 
Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jo...
Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jo...Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jo...
Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jo...
 

Similar to Distributed systems at ok.ru #rigadevday

Similar to Distributed systems at ok.ru #rigadevday (20)

Cassandra drivers and libraries
Cassandra drivers and librariesCassandra drivers and libraries
Cassandra drivers and libraries
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on Purpose
 
Intro to Databases
Intro to DatabasesIntro to Databases
Intro to Databases
 
Finding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLFinding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQL
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxGetting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of Datastax
 
Cassandra Drivers and Tools
Cassandra Drivers and ToolsCassandra Drivers and Tools
Cassandra Drivers and Tools
 
Big data 101 for beginners riga dev days
Big data 101 for beginners riga dev daysBig data 101 for beginners riga dev days
Big data 101 for beginners riga dev days
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Building and Scaling Node.js Applications
Building and Scaling Node.js ApplicationsBuilding and Scaling Node.js Applications
Building and Scaling Node.js Applications
 
Node.js for enterprise - JS Conference
Node.js for enterprise - JS ConferenceNode.js for enterprise - JS Conference
Node.js for enterprise - JS Conference
 
Concurrency (Fisher Syer S2GX 2010)
Concurrency (Fisher Syer S2GX 2010)Concurrency (Fisher Syer S2GX 2010)
Concurrency (Fisher Syer S2GX 2010)
 
Seven deadly sins of ElasticSearch Benchmarking
Seven deadly sins of ElasticSearch BenchmarkingSeven deadly sins of ElasticSearch Benchmarking
Seven deadly sins of ElasticSearch Benchmarking
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.js
 
Rails israel 2013
Rails israel 2013Rails israel 2013
Rails israel 2013
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
How Secure Are Docker Containers?
How Secure Are Docker Containers?How Secure Are Docker Containers?
How Secure Are Docker Containers?
 
Handout3o
Handout3oHandout3o
Handout3o
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 

More from odnoklassniki.ru

Кадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
Кадры решают все, или стриминг видео в «Одноклассниках». Александр ТобольКадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
Кадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
odnoklassniki.ru
 
За гранью NoSQL: NewSQL на Cassandra
За гранью NoSQL: NewSQL на CassandraЗа гранью NoSQL: NewSQL на Cassandra
За гранью NoSQL: NewSQL на Cassandra
odnoklassniki.ru
 
Платформа для видео сроком в квартал. Александр Тоболь.
Платформа для видео сроком в квартал. Александр Тоболь.Платформа для видео сроком в квартал. Александр Тоболь.
Платформа для видео сроком в квартал. Александр Тоболь.
odnoklassniki.ru
 
Cистема внутренней статистики Odnoklassniki.ru
Cистема внутренней статистики Odnoklassniki.ruCистема внутренней статистики Odnoklassniki.ru
Cистема внутренней статистики Odnoklassniki.ru
odnoklassniki.ru
 

More from odnoklassniki.ru (8)

Кадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
Кадры решают все, или стриминг видео в «Одноклассниках». Александр ТобольКадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
Кадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
 
За гранью NoSQL: NewSQL на Cassandra
За гранью NoSQL: NewSQL на CassandraЗа гранью NoSQL: NewSQL на Cassandra
За гранью NoSQL: NewSQL на Cassandra
 
Платформа для видео сроком в квартал. Александр Тоболь.
Платформа для видео сроком в квартал. Александр Тоболь.Платформа для видео сроком в квартал. Александр Тоболь.
Платформа для видео сроком в квартал. Александр Тоболь.
 
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
 
Аварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
Аварийный дамп – чёрный ящик упавшей JVM. Андрей ПаньгинАварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
Аварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
 
Управление тысячами серверов в Одноклассниках. Алексей Чудов.
Управление тысячами серверов в Одноклассниках. Алексей Чудов.Управление тысячами серверов в Одноклассниках. Алексей Чудов.
Управление тысячами серверов в Одноклассниках. Алексей Чудов.
 
Незаурядная Java как инструмент разработки высоконагруженного сервера
Незаурядная Java как инструмент разработки высоконагруженного сервераНезаурядная Java как инструмент разработки высоконагруженного сервера
Незаурядная Java как инструмент разработки высоконагруженного сервера
 
Cистема внутренней статистики Odnoklassniki.ru
Cистема внутренней статистики Odnoklassniki.ruCистема внутренней статистики Odnoklassniki.ru
Cистема внутренней статистики Odnoklassniki.ru
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Distributed systems at ok.ru #rigadevday

  • 1. Distributed Systems @ OK.RU Oleg Anastasyev @m0nstermind oa@ok.ru
  • 2. 1. Absolutely reliable network 2. with negligible Latency 3. and practically unlimited Bandwidth 4. It is homogenous 5. Nobody can break into our LAN 6. Topology changes are unnoticeable 7. All managed by single genius admin 8. So data transport cost is zero now 2 OK.ru has come to:
  • 3. 1. Absolutely reliable network 2. with negligible Latency 3. and practically unlimited Bandwidth 4. It is homogenous (same HW and hop cnt to every server) 5. Nobody can break into our LAN 6. Topology changes are unnoticeable 7. All managed by single genius admin 8. So data transport cost is zero now 3 Fallacies of distributed computing https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing [Peter Deutsch, 1994; James Gosling 1997]
  • 6. 6 My friends page 1. Retrieve friends ids 2. Filter by friendship type 3. Apply black list 4. Resolve ids to profiles 5. Sort profiles 6. Retrieve stickers 7. Calculate summaries
  • 7. 7 The Simple WayTM SELECT * FROM friendlist, users 
 WHERE userId=? AND f.kind=? AND u.name LIKE ? AND NOT EXISTS( SELECT * FROM blacklist …) …
  • 8. • Friendships • 12 billions of edges, 300GB • 500 000 requests per sec 8 Simple ways don't work • User profiles • > 350 millions, • 3 500 000 requests/sec, 50 Gbit/sec
  • 9. 9 How stuff works web frontend API frontend app server one-graph user-cache black-list microservices
  • 10. 10 Micro-service dissected Remote interface Business logic, caches [ Local storage ] 1 JVM
  • 11. 11 Micro-service dissected Remote interface https://github.com/odnoklassniki/one-nio interface GraphService extends RemoteService { @RemoteMethod long[] getFriendsByFilter(@Partition long vertexId, long relationMask); } interface UserCache {
 @RemoteMethod User getUserById(long id); }
  • 12. 12 App Server code https://github.com/odnoklassniki/one-nio long []friendsIds = graphService.getFriendsByFilter(userId, mask); List<User> users = new ArrayList<Long>(friendsIds.length); for (long id : friendsIds) { if(blackList.isAllowed(userId,id)) { users.add(userCache.getUserById(id)); } } … return users;
  • 13. • Partition by this parameter value • Using partitioning strategy • long id -> int partitionId(id) -> node1,node2,… • Strategies can be different • Cassandra ring, Voldemort partitions • or … 13 interface GraphService extends RemoteService { @RemoteMethod long[] getFriendsByFilter(@Partition long vertexId, long relationMask); }
  • 14. 14 Weighted quadrant p = id % 16 p = 0 p = 15 p = 1 N01 N02 N03 . . . 019 020 W=1 W=100 N11 node = wrr(p) SET
  • 15. 15 A coding issue https://github.com/odnoklassniki/one-nio long []friendsIds = graphService.getFriendsByFilter(userId, mask); List<User> users = new ArrayList<Long>(friendsIds.length); for (long id : friendsIds) { if(blackList.isAllowed(userId,id)) { users.add(userCache.getUserById(id)); } } … return users;
  • 16. 16 latency 
 = 1.0ms * 2 reqs * 200 friends
 = 400 ms
 A roundtrip price 0.1-0.3 ms 0.7-1.0 ms remote datacenter * this price is tightly coupled with the specific infrastructure and frameworks 10k friends latency = 20 seconds
  • 17. 17 Batch requests to the rescue public interface UserCache {
 @RemoteMethod( split = true ) Collection<User> getUsersByIds(long[] keys); } long []friendsIds = graphService.getFriendsByFilter(userId, mask); 
 friendsIds = blackList.filterAllowed(userId, friendsIds ); List<User> users = userCache.getUsersByIds(friendsIds); … return users;
  • 18. 18 split & merge split ( ids by p ) -> ids0, ids1 p = 0 p = 1 N01 N02 N03 . . . N11 ids0 ids1 users = merge (users0, users1)
  • 19. 19 1. Client crash 2. Server crash 3. Request omission 4. Response omission 5. Server timeout 6. Invalid value response 7. Arbitrary failure What could possibly fail ?
  • 21. • We can not prevent failures - only mask them • If a Failure can occur it will occur • Redundancy is a must to mask failures • Information ( error correction codes ) • Hardware (replicas, substitute hardware) • Time (transactions, retries) 21 What to do with failures ?
  • 22. 22 What happened to transaction ? Don’t give up! Must retry ! Must give up! 
 Don't retry ! ? ? Add Friend
  • 23. • Client does not really know • What client can do ? • Don’t make any guarantees. • Never retry. At Most Once. • Always retry. At Least Once. 23 Was friendship succeeded ?
  • 24. 1. Transaction in ACID database • single master, success is atomic (either yes or no) • atomic rollback is possible 2. Cache cluster refresh • many replicas, no master • no rollback, partial failures are possible 24 Making new friendship
  • 25. • Operation can be reapplied multiple times with same result • e.g.: read, Set.add(), Math.max(x,y) • Atomic change with order and dup control
 25 Idempotence “Always retry” policy can be applied
 only on
 Idempotent Operations https://en.wikipedia.org/wiki/Idempotence
  • 26. 26 Idempotence in ACID database Make friends wait; timeout Make friends (retry) Friendship, peace and bubble gum ! Already friends ? No, let’s make it ! Already friends ? Yes, NOP !
  • 27. 27 Sequencing MakeFriends (OpId) Made friends! Is Dup (OpId) ? No, making changes OpId := Generate() Generate() examples: • OpId+=1 • OpId=currentTimeMillis() • OpId=TimeUUID http://johannburkard.de/software/uuid/
  • 28. 1. Transaction in ACID database • single master, success is atomic (either yes or no) • atomic rollback is possible 2. Cache cluster refresh • many replicas, no master • no rollback, partial failures are possible 28 Making new friendship
  • 29. 29 Cache cluster refresh add(Friend) p = 0 N01 N02 N03 . . . But replicas state will diverge otherwise Retries are meaningless
  • 30. • Background data sync process • Reads updated records from ACID store
 
 SELECT * FROM users WHERE modified > ? • Applies them into its memory • Loads updates on node startup • Retry can be omitted then
 30 Syncing cache from DB
  • 31. 31 Death by timeout GC Make Friends wait; timeout thread pool 
 exhausted
  • 32. 1. Clients stop sending requests to server After X continuous failures for the last second 2. Clients monitor server availability In background, once a minute 3. And turn it back on 32 Server cut-off
  • 33. 33 Death by slowing down Avg = 1.5ms Max = 1.5c 24 cpu cores Cap = 24,000 ops Choose 2.4ms timeout ? Cut it off from client if latency avg > 2.4ms ? Avg = 24ms Max = 1.5s 24 cpu cores Cap = 1,000 ops 10,000 ops
  • 34. 34 Speculative retry Idemponent Op wait; timeout Retry Result Response
  • 35. • Makes requests to replicas before timeout • Better 99%, even average latencies • More stable system • Not always applicable: • Idempotent ops, additional load, traffic (to consider) • Can be balanced: always, >avg, >99p 35 Speculative retry
  • 36. More failures ! Distributed systems @ OK.RU
  • 37. • Excessive load • Excessive paranoia • Bugs • Human error • Massive outages 37 All replicas failure
  • 38. 38 Use of non-authoritative datasources, degrade consistency Use of incomplete data in UI, partial feature degradation
 Single feature full degradation Degrade (gracefully) !
  • 39. 39 The code interface UserCache {
 @RemoteMethod Distributed<Collection<User>> getUsersByIds(long[] keys); } interface Distributed<D> { boolean isInconsistency(); D getData(); } class UserCacheStub implements UserCache { 
 Distributed<Collection<User>> getUsersByIds(long[] keys) { return Distributed.inconsistent(); } }
  • 41. 41 The product you make Operations in production env What to test for failure ? “Standard” products - with special care !
  • 42. • What is does: • Detects network connections between servers • Disables them (iptables drop) • Runs auto tests • What we check • No crashes, nice UI messages are rendered • Server does start and can serve requests 42 The product we make : “Guerrilla”
  • 44. • To know an accident exists. Fast. • To track down to the source of accident. Fast. • To prevent accidents before they happen. 44 Why
  • 45. • Zabbix • Cacti • Operational metrics • Names od operations, e.g. “Graph.getFriendsByFilter” • Call count, their success or failure • Latency of calls 45 Is (will) there be accident ?
  • 46. • Current metrics and trends • Aggregated call and failure counts • Aggregated latencies • Average, Max • Percentiles 50,75,98,99,99.9 46 What charts show to us
  • 49. • The possibilities for failure in distributed systems are endless • Don't “prevent”, but mask failures through redundancy • Degrade gracefully on unmask-able failure • Test failures • Production diagnostics are key to failure detection and prevention 49 Short summary
  • 50. 50 Distributed Systems at OK.RU slideshare.net/m0nstermind https://v.ok.ru/publishing.html http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf Notes on Theory of Distributed Systems CS 465/565: 
 Spring 2014 James Aspnes Try these links for more