Distributed systems at ok.ru #rigadevday

Distributed Systems @ OK.RU
Oleg Anastasyev
@m0nstermind
oa@ok.ru

1. Absolutely reliable network
2. with negligible Latency
3. and practically unlimited Bandwidth
4. It is homogenous
5. Nobody can break into our LAN
6. Topology changes are unnoticeable
7. All managed by single genius admin
8. So data transport cost is zero now
2
OK.ru has come to:

1. Absolutely reliable network
2. with negligible Latency
3. and practically unlimited Bandwidth
4. It is homogenous (same HW and hop cnt to every server)
5. Nobody can break into our LAN
6. Topology changes are unnoticeable
7. All managed by single genius admin
8. So data transport cost is zero now
3
Fallacies of distributed computing
https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
[Peter Deutsch, 1994; James Gosling 1997]

4
4
Datacenters
150
distinct
microservices
8000
iron servers
OK.RU has come to:

5
hardware
engineers
network
engineers
operations
developers

6
My friends page
1. Retrieve friends ids
2. Filter by friendship type
3. Apply black list
4. Resolve ids to profiles
5. Sort profiles
6. Retrieve stickers
7. Calculate summaries

7
The Simple WayTM
SELECT * FROM friendlist, users  
WHERE userId=? AND f.kind=? AND u.name LIKE ?
AND NOT EXISTS( SELECT * FROM blacklist …)
…

• Friendships
• 12 billions of edges, 300GB
• 500 000 requests per sec
8
Simple ways don't work
• User profiles
• > 350 millions,
• 3 500 000 requests/sec, 50 Gbit/sec

9
How stuff works
web frontend API frontend
app server
one-graph user-cache black-list
microservices

10
Micro-service dissected
Remote interface
Business logic, caches
[ Local storage ]
1 JVM

11
Micro-service dissected
Remote interface
https://github.com/odnoklassniki/one-nio
interface GraphService extends RemoteService {
@RemoteMethod
long[] getFriendsByFilter(@Partition long vertexId, long relationMask);
}
interface UserCache { 
@RemoteMethod
User getUserById(long id);
}

12
App Server code
long []friendsIds = graphService.getFriendsByFilter(userId, mask);
List<User> users = new ArrayList<Long>(friendsIds.length);
for (long id : friendsIds) {
if(blackList.isAllowed(userId,id)) {
users.add(userCache.getUserById(id));
}
}
…
return users;

• Partition by this parameter value
• Using partitioning strategy
• long id -> int partitionId(id) -> node1,node2,…
• Strategies can be different
• Cassandra ring, Voldemort partitions
• or …
13
interface GraphService extends RemoteService {
@RemoteMethod
long[] getFriendsByFilter(@Partition long vertexId, long relationMask);
}

14
Weighted quadrant
p = id % 16
p = 0
p = 15
p = 1
N01 N02 N03 . . . 019 020
W=1
W=100
N11
node = wrr(p)
SET

15
A coding issue
List<User> users = new ArrayList<Long>(friendsIds.length);
for (long id : friendsIds) {
if(blackList.isAllowed(userId,id)) {
users.add(userCache.getUserById(id));
}
}
…
return users;

16
latency  
= 1.0ms * 2 reqs * 200 friends 
= 400 ms 
A roundtrip price
0.1-0.3 ms
0.7-1.0 ms
remote datacenter
* this price is tightly coupled with the specific infrastructure and frameworks
10k friends latency = 20 seconds

17
Batch requests to the rescue
public interface UserCache { 
@RemoteMethod( split = true )
Collection<User> getUsersByIds(long[] keys);
}
 
friendsIds = blackList.filterAllowed(userId, friendsIds );
List<User> users = userCache.getUsersByIds(friendsIds);
…
return users;

18
split & merge
split ( ids by p )
-> ids0, ids1
p = 0
p = 1
N01 N02 N03 . . .
N11
ids0
ids1
users = merge (users0, users1)

19
1. Client crash
2. Server crash
3. Request omission
4. Response omission
5. Server timeout
6. Invalid value response
7. Arbitrary failure
What could possibly fail ?

Failures
Distributed systems at OK.RU

• We can not prevent failures - only mask them
• If a Failure can occur it will occur
• Redundancy is a must to mask failures
• Information ( error correction codes )
• Hardware (replicas, substitute hardware)
• Time (transactions, retries)
21
What to do with failures ?

22
What happened to transaction ?
Don’t give up!
Must retry !
Must give up!  
Don't retry !
? ?
Add Friend

• Client does not really know
• What client can do ?
• Don’t make any guarantees.
• Never retry. At Most Once.
• Always retry. At Least Once.
23
Was friendship succeeded ?

1. Transaction in ACID database
• single master, success is atomic (either yes or no)
• atomic rollback is possible
2. Cache cluster refresh
• many replicas, no master
• no rollback, partial failures are possible
24
Making new friendship

• Operation can be reapplied multiple times with same result
• e.g.: read, Set.add(), Math.max(x,y)
• Atomic change with order and dup control 
25
Idempotence
“Always retry” policy can be applied 
only on 
Idempotent Operations
https://en.wikipedia.org/wiki/Idempotence

26
Idempotence in ACID database
Make friends
wait; timeout
Make friends (retry)
Friendship, peace and bubble gum !
Already friends ?
No, let’s make it !
Already friends ?
Yes, NOP !

27
Sequencing
MakeFriends (OpId)
Made friends!
Is Dup (OpId) ?
No, making changes
OpId := Generate()
Generate() examples:
• OpId+=1
• OpId=currentTimeMillis()
• OpId=TimeUUID
http://johannburkard.de/software/uuid/

1. Transaction in ACID database
• single master, success is atomic (either yes or no)
• atomic rollback is possible
2. Cache cluster refresh
• many replicas, no master
• no rollback, partial failures are possible
28
Making new friendship

29
Cache cluster refresh
add(Friend)
p = 0 N01 N02 N03 . . .
But replicas state will diverge otherwise
Retries are meaningless

• Background data sync process
• Reads updated records from ACID store 
 
SELECT * FROM users WHERE modified > ?
• Applies them into its memory
• Loads updates on node startup
• Retry can be omitted then 
30
Syncing cache from DB

31
Death by timeout
GC
Make Friends
wait; timeout
thread pool  
exhausted

1. Clients stop sending requests to server
After X continuous failures for the last second
2. Clients monitor server availability
In background, once a minute
3. And turn it back on
32
Server cut-off

33
Death by slowing down
Avg = 1.5ms
Max = 1.5c
24 cpu cores
Cap = 24,000 ops
Choose 2.4ms timeout ?
Cut it off from client if latency avg > 2.4ms ?
Avg = 24ms
Max = 1.5s
24 cpu cores
Cap = 1,000 ops
10,000 ops

34
Speculative retry
Idemponent Op
wait; timeout
Retry
Result Response

• Makes requests to replicas before timeout
• Better 99%, even average latencies
• More stable system
• Not always applicable:
• Idempotent ops, additional load, traffic (to consider)
• Can be balanced: always, >avg, >99p
35
Speculative retry

More failures !
Distributed systems @ OK.RU

• Excessive load
• Excessive paranoia
• Bugs
• Human error
• Massive outages
37
All replicas failure

38
Use of non-authoritative datasources,
degrade consistency
Use of incomplete data in UI,
partial feature degradation 
Single feature full degradation
Degrade (gracefully) !

39
The code
interface UserCache { 
@RemoteMethod
Distributed<Collection<User>> getUsersByIds(long[] keys);
}
interface Distributed<D>
{
boolean isInconsistency();
D getData();
}
class UserCacheStub implements UserCache {
 
Distributed<Collection<User>> getUsersByIds(long[] keys) {
return Distributed.inconsistent();
}
}

Resilience testing

41
The product you make
Operations in production env
What to test for failure ?
“Standard” products - with special care !

• What is does:
• Detects network connections between servers
• Disables them (iptables drop)
• Runs auto tests
• What we check
• No crashes, nice UI messages are rendered
• Server does start and can serve requests
42
The product we make : “Guerrilla”

Production diagnostics

• To know an accident exists. Fast.
• To track down to the source of accident. Fast.
• To prevent accidents before they happen.
44
Why

• Zabbix
• Cacti
• Operational metrics
• Names od operations, e.g. “Graph.getFriendsByFilter”
• Call count, their success or failure
• Latency of calls
45
Is (will) there be accident ?

• Current metrics and trends
• Aggregated call and failure counts
• Aggregated latencies
• Average, Max
• Percentiles 50,75,98,99,99.9
46
What charts show to us

• The possibilities for failure in distributed systems are endless
• Don't “prevent”, but mask failures through redundancy
• Degrade gracefully on unmask-able failure
• Test failures
• Production diagnostics are key to failure detection and prevention
49
Short summary

50 Distributed Systems at OK.RU
slideshare.net/m0nstermind
https://v.ok.ru/publishing.html
http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf
Notes on Theory of Distributed Systems CS 465/565:  
Spring 2014
James Aspnes
Try these links for more

Distributed systems at ok.ru #rigadevday

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Distributed systems at ok.ru #rigadevday

Similar to Distributed systems at ok.ru #rigadevday (20)

More from odnoklassniki.ru

More from odnoklassniki.ru (8)

Recently uploaded

Recently uploaded (20)

Distributed systems at ok.ru #rigadevday