16 – 17 November, SofiaISTACON.ORG
Running in multiple
data centers
By Nikolay Stoitsev
16 – 17 November, SofiaISTACON.ORG
16 – 17 November, SofiaISTACON.ORG
600+ cities
16 – 17 November, SofiaISTACON.ORG
75+ countries
16 – 17 November, SofiaISTACON.ORG
6 continents
2 000 000+
drivers
16 – 17 November, SofiaISTACON.ORG
How the Internet Kept Humming During 2 Hurricanes
https://www.nytimes.com/2017/09/18/us/harvey-irma-internet.html
16 – 17 November, SofiaISTACON.ORG
Fault tolerance
16 – 17 November, SofiaISTACON.ORG
Low latency
16 – 17 November, SofiaISTACON.ORG
Compliance
16 – 17 November, SofiaISTACON.ORG
Data locality
16 – 17 November, SofiaISTACON.ORG
Under-utilized capacity
16 – 17 November, SofiaISTACON.ORG
CAP
16 – 17 November, SofiaISTACON.ORG
Continuous network
partition
16 – 17 November, SofiaISTACON.ORG
2 types of architecture
16 – 17 November, SofiaISTACON.ORG
Active-Passive
DC 1 DC 2
DC 1 DC 2
16 – 17 November, SofiaISTACON.ORG
Failover
16 – 17 November, SofiaISTACON.ORG
DNS
16 – 17 November, SofiaISTACON.ORG
Stateless service
16 – 17 November, SofiaISTACON.ORG
Stateful service
DC 1
DC 2
DB 1
DB 2
Active-Passive example
DC 1
DC 2
DB 1
DB 2
Active-Passive example
DC 1
DC 2
DB 1
DB 2
Active-Passive example
DC 1
DC 2
Master
Slave
Slave
Slave
Real-life example
DC 1
DC 2
Master
Slave
Slave
Slave
HAProxy
HAProxy
Smart intermediary
DC 1
DC 2
Master
Slave
Slave
Slave
HAProxy
HAProxy
Smart intermediary
DC 1
DC 2
Slave
Slave
Master
Slave
HAProxy
HAProxy
Smart intermediary
16 – 17 November, SofiaISTACON.ORG
All-active
DC 1 DC 2
16 – 17 November, SofiaISTACON.ORG
Locality
16 – 17 November, SofiaISTACON.ORG
Split traffic in groups
16 – 17 November, SofiaISTACON.ORG
Global State
mod 2
DC 1
DC 2
user_id
= 0
= 1
Partitioning
mod 3
DC 1
DC 2
user_id
= 0
= 1
DC 3
= 2
Partitioning
16 – 17 November, SofiaISTACON.ORG
Very inefficient
16 – 17 November, SofiaISTACON.ORG
Consistent hashing
DC 1
DC 3
DC 2DC 3
16 – 17 November, SofiaISTACON.ORG
Consistent hashing
DC 1
DC 3
DC 2DC 3
user_id
16 – 17 November, SofiaISTACON.ORG
DNS load balancing
16 – 17 November, SofiaISTACON.ORG
DC 1 DC 2
San
Francisco
Los Angeles New York Toronto
16 – 17 November, SofiaISTACON.ORG
DC 1 DC 2
San
Francisco
Los Angeles New York Toronto
16 – 17 November, SofiaISTACON.ORG
Database layer
16 – 17 November, SofiaISTACON.ORG
No generic solution
16 – 17 November, SofiaISTACON.ORG
Galera Cluster
Synchronous multi-master database cluster
http://galeracluster.com/
16 – 17 November, SofiaISTACON.ORG
DC 1
Master
Slave
DC 2
Slave
Master
DC 3
Master
Slave
DC 4
Slave
Master
16 – 17 November, SofiaISTACON.ORG
Apache Cassandra
http://cassandra.apache.org/
16 – 17 November, SofiaISTACON.ORG
Linear scalability
Fault-tolerance
Commodity hardware
16 – 17 November, SofiaISTACON.ORG
Designed for multiple data
centers
16 – 17 November, SofiaISTACON.ORG
Apache Mesos
http://mesos.apache.org/
16 – 17 November, SofiaISTACON.ORG
Application Layer
16 – 17 November, SofiaISTACON.ORG
Apache Kafka
16 – 17 November, SofiaISTACON.ORG
uReplicator
https://github.com/uber/uReplicator
16 – 17 November, SofiaISTACON.ORG
https://eng.uber.com/ureplicator/
16 – 17 November, SofiaISTACON.ORG
https://eng.uber.com/ureplicator/
16 – 17 November, SofiaISTACON.ORG
Cherami
https://github.com/uber/cherami-server
16 – 17 November, SofiaISTACON.ORG
Multi-zone topics
Producer
Producer
Topic
Topic
Consumer
Group
Consumer
Group
replication
16 – 17 November, SofiaISTACON.ORG
Multi-zone consumers
Producer Topic
Topic
Consumer
Group
Consumer
Group
replication offset sync
16 – 17 November, SofiaISTACON.ORG
https://eng.uber.com/cherami/
16 – 17 November, SofiaISTACON.ORG
Lessons learned
16 – 17 November, SofiaISTACON.ORG
Total dev
time
Time
thinking
about
failover
16 – 17 November, SofiaISTACON.ORG
Total dev
time
Time
thinking
about
failover
16 – 17 November, SofiaISTACON.ORG
Failover testing
16 – 17 November, SofiaISTACON.ORG
Failure testing
16 – 17 November, SofiaISTACON.ORG
Super smart clients
16 – 17 November, SofiaISTACON.ORG
“The best way to avoid failure is to fail
constantly.
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
16 – 17 November, SofiaISTACON.ORG
Thank you!
@stoitsev
Nikolay Stoitsev
http://careersinfo.uber.com/sofia-engineering

Running in multiple data centers

  • 2.
    16 – 17November, SofiaISTACON.ORG Running in multiple data centers By Nikolay Stoitsev
  • 3.
    16 – 17November, SofiaISTACON.ORG
  • 4.
    16 – 17November, SofiaISTACON.ORG 600+ cities
  • 5.
    16 – 17November, SofiaISTACON.ORG 75+ countries
  • 6.
    16 – 17November, SofiaISTACON.ORG 6 continents
  • 7.
  • 9.
    16 – 17November, SofiaISTACON.ORG How the Internet Kept Humming During 2 Hurricanes https://www.nytimes.com/2017/09/18/us/harvey-irma-internet.html
  • 10.
    16 – 17November, SofiaISTACON.ORG Fault tolerance
  • 11.
    16 – 17November, SofiaISTACON.ORG Low latency
  • 12.
    16 – 17November, SofiaISTACON.ORG Compliance
  • 13.
    16 – 17November, SofiaISTACON.ORG Data locality
  • 14.
    16 – 17November, SofiaISTACON.ORG Under-utilized capacity
  • 15.
    16 – 17November, SofiaISTACON.ORG CAP
  • 16.
    16 – 17November, SofiaISTACON.ORG Continuous network partition
  • 17.
    16 – 17November, SofiaISTACON.ORG 2 types of architecture
  • 18.
    16 – 17November, SofiaISTACON.ORG Active-Passive
  • 19.
  • 20.
  • 21.
    16 – 17November, SofiaISTACON.ORG Failover
  • 22.
    16 – 17November, SofiaISTACON.ORG DNS
  • 23.
    16 – 17November, SofiaISTACON.ORG Stateless service
  • 24.
    16 – 17November, SofiaISTACON.ORG Stateful service
  • 25.
    DC 1 DC 2 DB1 DB 2 Active-Passive example
  • 26.
    DC 1 DC 2 DB1 DB 2 Active-Passive example
  • 27.
    DC 1 DC 2 DB1 DB 2 Active-Passive example
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
    16 – 17November, SofiaISTACON.ORG All-active
  • 33.
  • 34.
    16 – 17November, SofiaISTACON.ORG Locality
  • 35.
    16 – 17November, SofiaISTACON.ORG Split traffic in groups
  • 36.
    16 – 17November, SofiaISTACON.ORG Global State
  • 37.
    mod 2 DC 1 DC2 user_id = 0 = 1 Partitioning
  • 38.
    mod 3 DC 1 DC2 user_id = 0 = 1 DC 3 = 2 Partitioning
  • 39.
    16 – 17November, SofiaISTACON.ORG Very inefficient
  • 40.
    16 – 17November, SofiaISTACON.ORG Consistent hashing DC 1 DC 3 DC 2DC 3
  • 41.
    16 – 17November, SofiaISTACON.ORG Consistent hashing DC 1 DC 3 DC 2DC 3 user_id
  • 42.
    16 – 17November, SofiaISTACON.ORG DNS load balancing
  • 43.
    16 – 17November, SofiaISTACON.ORG DC 1 DC 2 San Francisco Los Angeles New York Toronto
  • 44.
    16 – 17November, SofiaISTACON.ORG DC 1 DC 2 San Francisco Los Angeles New York Toronto
  • 45.
    16 – 17November, SofiaISTACON.ORG Database layer
  • 46.
    16 – 17November, SofiaISTACON.ORG No generic solution
  • 47.
    16 – 17November, SofiaISTACON.ORG Galera Cluster Synchronous multi-master database cluster http://galeracluster.com/
  • 48.
    16 – 17November, SofiaISTACON.ORG DC 1 Master Slave DC 2 Slave Master DC 3 Master Slave DC 4 Slave Master
  • 49.
    16 – 17November, SofiaISTACON.ORG Apache Cassandra http://cassandra.apache.org/
  • 50.
    16 – 17November, SofiaISTACON.ORG Linear scalability Fault-tolerance Commodity hardware
  • 51.
    16 – 17November, SofiaISTACON.ORG Designed for multiple data centers
  • 52.
    16 – 17November, SofiaISTACON.ORG Apache Mesos http://mesos.apache.org/
  • 53.
    16 – 17November, SofiaISTACON.ORG Application Layer
  • 54.
    16 – 17November, SofiaISTACON.ORG Apache Kafka
  • 55.
    16 – 17November, SofiaISTACON.ORG uReplicator https://github.com/uber/uReplicator
  • 56.
    16 – 17November, SofiaISTACON.ORG https://eng.uber.com/ureplicator/
  • 57.
    16 – 17November, SofiaISTACON.ORG https://eng.uber.com/ureplicator/
  • 58.
    16 – 17November, SofiaISTACON.ORG Cherami https://github.com/uber/cherami-server
  • 59.
    16 – 17November, SofiaISTACON.ORG Multi-zone topics Producer Producer Topic Topic Consumer Group Consumer Group replication
  • 60.
    16 – 17November, SofiaISTACON.ORG Multi-zone consumers Producer Topic Topic Consumer Group Consumer Group replication offset sync
  • 61.
    16 – 17November, SofiaISTACON.ORG https://eng.uber.com/cherami/
  • 62.
    16 – 17November, SofiaISTACON.ORG Lessons learned
  • 63.
    16 – 17November, SofiaISTACON.ORG Total dev time Time thinking about failover
  • 64.
    16 – 17November, SofiaISTACON.ORG Total dev time Time thinking about failover
  • 65.
    16 – 17November, SofiaISTACON.ORG Failover testing
  • 66.
    16 – 17November, SofiaISTACON.ORG Failure testing
  • 67.
    16 – 17November, SofiaISTACON.ORG Super smart clients
  • 68.
    16 – 17November, SofiaISTACON.ORG “The best way to avoid failure is to fail constantly. http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
  • 69.
    16 – 17November, SofiaISTACON.ORG Thank you! @stoitsev Nikolay Stoitsev http://careersinfo.uber.com/sofia-engineering