Distributed Systems: scalability and high availability

Distributed Systems

scalability and high availability

Renato Lucindo - lucindo.github.com - @rlucindo

Renato Lucindo

Call me Lucindo (or Linus)
2002 - Bachelor Computer Science
2007 - M.Sc. Computer Science (Combinatorial
Optimization)
7+ year developing Distributed Systems

My default answer: "I don't know."

Agenda

Scalability

High Availability

Problems

Tips and Tricks

Learning More

Distributed Systems

Multiple computers that interact with each other over a
network to achieve a common goal
Purpose
Scalability
High availability

source: http://www.cnds.jhu.edu/

Scalability

System ability to handle gracefully a growing amount of
work

Scale up (vertical)
Add resources to a single node
Improve existing code to handle more work

Scale out (horizontal)
Add more nodes to a system
Linear (or better) scalability

Scalability - Vertical

Add: CPU, Memory, Disks (bigger box)
Handling more simultaneous:
Connections
Operations
Users
Choose a good I/O and concurrency model
Non-blocking I/O
Asynchronous I/O
Threads (single, pool, per-connection)
Event handling patterns (Reactor, Proactor, ...)
Memory model?
STM

Scalability - Vertical

Careful with numbers
Requests per second
# of Connections
Simultaneous operations
Event handling
Think front-end
Slow connections/clients
It's slower than other options
In doubt, go async
Back-end
Thread pool (thread per-connection)
No events
Process per-core

Scalability - Horizontal

Add nodes to handle more work
Front-end
Straightforward
Stateless
Back-end
Master/Slave(s)
Partitioning
DHT
Volatile Index


Master/Slave
Write on single Master
Read on Slaves (one or more)
Scales reads


Partitioning (Sharding)
Distribute dada across nodes
Generally involves data de-normalization
Where is some specific data?
Master Index
Hash (DTH, Consistent Hashing)
Volatile Index
Joins done in application level
NoSQL friendly


Volatile Index: build and maintain data index as cached
information (all clients)

High Availability

"Processes, as well as people, die"

Handle hardware and software failures
Eliminate single point of failure
Redundancy
Failover
Replicas

High Availability - Failover/Redundancy

High Availability - Replicas

Two or more copies of same data
Replica granularity
From node replica to "row" replica
Load balancing
Write concurrency
Replica updates
Key for high availability and root of several problems

Problems - CAP Theorem

Consistency: all operations (reads/writes) yield a global
consistent state

Availability: all requests (on non-failed servers) must have
a response

Partition Tolerance: nodes may not be able
to communicate with each other.

Pick Two


C + A: network problems might stop the system

Examples:
Oracle RAC, IBM DB2 Parallel
RDBMS (Master/Slave)
Google File System
HDFS (Hadoop)


C + P: clients can't always perform operations

Examples:
Distributed lock-systems: Chubby, ZooKeeper
Paxos protocol (consensus)
BigTable, Hbase
Hypertable
MongoDB


A + P: clients may read inconsistent (old or undone) data

Examples:
Amazon Dynamo
Cassandra
Voldemort
CouchDB
Riak
Caches

Problem with CAP Theorem

In practice, C + A and C + P systems are the same.
C + A: not tolerant of network partitions
C + P: not available when a network partition occurs
Big problem: network partition
Not so big (how often does it happens?)
Pick two
Availability
Consistency
The forgotten: Latency
Or, how long the system waits before considering a
partitioned network?

Problems - Real World

Every component may fail:
Network failure
Hardware failure
Electricity
Natural disasters
Code failure

Tips & Tricks - Pyramid

Capacity (connections, operations, ...) Pyramid

Tips & Tricks - Reply Fast

FAIL Fast
Break complex requests into smaller ones
Use timeouts
No transactions
Be aware that a single slow operation or component can
generate contention
Self-denial attack

Tips & Tricks - Cache

Cache: component location, data, dns lookups, previous
requests, etc
Use negative cache for failed requests (low expiration)
Don't rely on cache
Your system must work with no cache

Tips & Tricks - Queues

Easy way to add asynchronous processing an decouple
your system.

Tips & Tricks - Logs

Log everything
Use several log levels
On every log message
User
Request host
Component involved
Version
Filename and line
If log level not enabled do not process log message
Avoid lookup calls (gettimeofday)

Tips & Tricks - Domino Effect

Make sure your load balancer won't overload components
User smart algorithms
Load Balance
Resource Allocation

Tips & Tricks - (Zero) Configuration

No configuration files
Use good defaults
Auto-discovery (multicast, gossip, ...)
Make everything configurable
Administrative command
No need to stop for changes
Automatic self adjusts when possible

Tips & Tricks - STOP Test

With your system under load: kill -STOP <component>

Tips & Tricks - Know your tools

load average (uptime)
stats tools
vmstat
iostat
mpstat
tcpstat, tcprstat, etc
tcpdump, nc, netstat
tunning
/proc/net/*
ulimit
sysctl
oprofile
debuging tools (gdb, valgrind)
...

Tips & Tricks - Count

Count everything
Connections
Operations
Failures
Successes
Request times (granularity)
Total, average, standard deviation
Monitor counters

Tips & Tricks - Stability Patterns

Use Timeouts
Circuit Breaker
Bulkheads
Steady State
Fail Fast
Handshaking
Test Harness
Decoupling Middleware

Learning More - Books

TCP/IP Illustrated, Vol. 1: The Protocols


Unix Network Programming, Vol. 1: The Sockets Networking


Pattern Oriented Software Architecture, Vol. 2


Release It!

Learning More - Papers

The Google File System
Bigtable: A Distributed Storage System for Structured Data
Dynamo: Amazon's Highly Available Key-Value Store
PNUTS: Yahoo!’s Hosted Data Serving Platform
MapReduce: Simplified Data Processing on Large Clusters

Towards robust distributed systems
Brewer's conjecture and the feasibility of consistent,
available, partition-tolerant web services
BASE: An Acid Alternative
Looking up data in P2P systems

Thanks!!! Questions?

lucindo.github.com - @rlucindo

Distributed Systems: scalability and high availability

More Related Content

What's hot

Similar to Distributed Systems: scalability and high availability

Recently uploaded

Distributed Systems: scalability and high availability