Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reversim 2017 cross region data replication design considerations

1,426 views

Published on

Different requirements (high availability, data residency) and high level designs for aws cross region data replication (S3 vs dynamodb vs kinesis vs couchbase vs cassandra). This talk will focus on requirements, data consistency and write conflicts (CRDT example). It is a "theoretical" talk in the sense that no Forter specific design is presented, and should guide architects that want to design their service with "cross-region" in mind.

Published in: Technology
  • Be the first to comment

Reversim 2017 cross region data replication design considerations

  1. 1. Cross Region Data Replication Design Considerations Itai Friendinger itai@forter.com
  2. 2. Our financial institutions remain strong, and the American economy will be open for business as well. 2/40
  3. 3. TX Fraud Decision 100ms Decision as a Service Example if isFraud(tx.address,tx.payment) { return DECLINE; } else { return APPROVE; } TX Decision 3/40
  4. 4. Event Processor 1000ms Change Account Address Change Account Payment Unified People Store TX partial update read Decision as a Service Example TX Fraud Decision 100ms TX Decision 4/40
  5. 5. Design ‫בסדר‬ ‫יהיה‬ TX Fraud Decision TX Decision Event Processor People Store raw event ● No Cross Region Replication 5/40
  6. 6. Design ‫עליי‬ ● Cron Sync every 3 hours ● Replication != Reconciliation ● Replication != Backup TX Fraud Decision Event Processor People Store TX Fraud Decision TX Decision Event Processor People Store raw event Cron Sync raw event TXDecision 6/40
  7. 7. ● Read-Only RDS Replica ● Proxying data into a single Data Center ● Requires quarterly failover drills ● Cannot stand a real disaster for long Design ‫פסדר‬ ‫יאללה‬ TX Fraud Decision Event Forwarder People Store TX Fraud Decision TX Decision Event Processor People Store raw event RDS Replication raw event TX Forwarding Decision 7/40
  8. 8. Design ‫אחד‬ ‫במחיר‬ ‫שניים‬ ● CloudEndure DRaaS ● Point In Time Recovery ● Requires quarterly failover drills ● For existing apps (Enterprises) People Store TX Fraud Decision TX Decision Event Processor People Store raw event Block Device Replication 8/40
  9. 9. Design ‫חכה‬ ‫חכה‬ ● Google Cloud Spanner Is Here Geo Distributed Transactions Is Coming ● For green-field apps (Startups) TX Fraud Decision Event Processor People Store TX Fraud Decision TX Decision Event Processor People Store raw event Transactions raw event TXDecision 9/40
  10. 10. Design ‫סמוך‬ ● Out-Of-The-Box Real-Time Bi-Directional Data-Center Aware Replication ● Write Conflict resolution TX Fraud Decision TX Decision Event Processor People Store raw event 2Way Replication TX Fraud Decision Event Processor People Store raw event TXDecision 10/40
  11. 11. Design ‫שלה‬ ‫אחות‬ ● Replication of Raw Events ● State Divergence TX Fraud Decision TX Decision Event Processor People Store raw event 2way Replication TX Fraud Decision Event Processor People Store raw event TXDecision 11/40
  12. 12. Read Consistency Guarantees Loosely based on Consistency Explained Through Baseball by Doug Terry ● Strong ⇒ 2:2 ○ See all previous writes ● Read own Writes ○ See all writes performed by reader ● Monotonic ⇒ 2:1 ○ See all writes since the beginning till N seconds ago ● Eventual ⇒ 1:2 ○ See the writes in different order (some still missing) time partial update state 15m Hapoel =1 1:0 32m Maccabi =1 1:1 89m Hapoel =2 2:1 91m Maccabi =2 2:2 14/40
  13. 13. Hello Couchbase read-mutate-write of entire state Client reaches cluster’s primary node Conflict Prevention CAS Optimizations: subdocument API Strong node us-west-2b node us-west-2c Event Processor (read/m/write) TX Decision (read) Strong 16/40
  14. 14. Hello Couchbase XDCR replicates entire state between clusters Optimizations: dedup by key, metadata first Strong Monotonic XDCR node us-west-2b node us-west-2c Event Processor (read/m/write) node us-east-1c node us-east-1b TX Decision (read) TX Decision (read) Strong 17/40
  15. 15. Couchbase Last Write Wins Conflict Resolution - LWW erases losing side Remember: NTP, no “sudo date” Document Version = read-own-writes Monotonic XDCR node us-west-2b node us-west-2c Event Processor (read/m/write) node us-east-1c node us-east-1b TX Decision (read) TX Decision (read) Monotonic read-own-writes Event Processor (read/m/write) ‫סמוך‬ Design Conflict Resolution 48bit timestamp Conflict Prevention 16bit CAS 19/40
  16. 16. Hello Cassandra node us-west-2b node us-west-2a node us-west-2c Event Processor (partial update) node us-east-1b TX Decision (read) Client reaches closest node, blocks until LOCAL_QUARUM No Conflict Prevention ⇒ Use partial updates or inserts Strong (?) node us-east-1c node us-east-1a TX Decision (read) 21/40
  17. 17. Cassandra Last Write Wins per Column Two clients update payment and address of same person with exactly same client timestamps. (?) (?) update payment wins update address wins node us-west-2b node us-west-2a node us-west-2c Event Processor (partial update) node us-east-1c node us-east-1a node us-east-1b TX Decision (read) TX Decision (read) Event Processor (partial update) ‫סמוך‬ Design 23/40
  18. 18. Cassandra Multi Value per Column Update different columns of same person Conflict resolution in TX Decision (on read) (?) (?) update payment1, address1 update payment2, address2 node us-west-2b node us-west-2a node us-west-2c Event Processor (partial update) node us-east-1c node us-east-1a node us-east-1b TX Decision (read) TX Decision (read) Event Processor (partial update) ‫סמוך‬ Design 25/40
  19. 19. Kafka Kafka us-west-2 Event Source (insert) Kafka us-east-1 TX Decision (read) Event Processor Event Processor S3 versioned us-east-1 TX Decision (read) S3 versioned us-west-2 (?) (?) Event Source (insert) mirror(s) us-west mirror(s) us-west mirror(s) us-west mirror(s) us-west mirror(s) us-west mirror(s) us-east inserts Conflict resolution in Event Processor Will both regions converge into the same state? ‫שלו‬ ‫אח‬ Design 27
  20. 20. Converging events into state ● Duplicate events ○ Idempotent compare-and-set(x, 2, 5) ○ De-duplication 2 +3 +3 = 5 ○ Rollback ● Unordered events ○ Commutative 2+3=3+2 ○ reordering window (requires state) ● Bulk/Parallel event processing ○ Associative (2+3)+4 = 2+(3+4) 29/40
  21. 21. Kafka Streams API - zooming in Kafka us-west-2 Event Source (insert) Kafka us-east-1 TX Decision (read) Event Processor Event Processor S3 versioned us-east-1 TX Decision (read) S3 versioned us-west-2 (?) (?) Event Source (insert) mirror(s) us-west mirror(s) us-west mirror(s) us-west mirror(s) us-west mirror(s) us-west mirror(s) us-east inserts ‫שלו‬ ‫אח‬ Design
  22. 22. Kafka Streams API Kafka MirrorMaker (?) Kafka S3 Connector Kafka Stream API ‫סמוך‬ Design Event Source (insert) builder.stream("kstream1","kstream2") .filter(predicate) .transform(processor) .to("ktable") S3 kstream1 kstream2 ktable 30/40
  23. 23. Kafka Processor API and Local Store Kafka MirrorMaker (?) Kafka S3 Connector Kafka Stream API ‫סמוך‬ Design Event Source (insert) kstream1 kstream2 ktable Map process(Map event) { Map state = kvStore.get(event.key); state.putAll(event); // not commutative (order matters) kvStore.put(event.key, state); return state; } S3 32/40
  24. 24. CRDT Graph Model Conflict-free Replicated Data Type Idempotent, Commutative, Associative ● Insert Only Graph ● Address / Payment / Person Objects
  25. 25. G-Set: Growing Set CRDT Conflict-free Replicated Data Type Idempotent, Commutative, Associative A B us-west-2 event us-east-1 state {A,B} {A,B}
  26. 26. G-Set: Growing Set CRDT Conflict resolution method: merge sets A C B us-west-2 event us-east-1 state {A,B} {A,B} {A,C} {A,B,C}
  27. 27. Comprised of two G-Sets (added and tombstone) A B us-west-2 event us-east-1 state add: {A,B} rmv: {A} add: {A,B} rmv: {A} 2P-Set: Two Phase Set CRDT
  28. 28. A C B us-west-2 event us-east-1 state add: {A,B} rmv: {A} add: {A,B} rmv: {A} add: {A,C} rmv: {B,D} add: {A,B,C} rmv: {A,B,D} Always grows Garbage Collection algorithms exist. 2P-Set: Two Phase Set CRDT
  29. 29. D A C B us-west-2 event us-east-1 state add: {A,B} rmv: {A} add: {A,B} rmv: {A} add: {A,C} rmv: {B,D} add: {A,B,C} rmv: {A,B,D} add: {D} add: {A,B,C,D} rmv: {A,B,D} Always grows Garbage Collection algorithms exist. 2P-Set: Two Phase Set CRDT
  30. 30. A C B us-west-2 event us-east-1 state add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} 2P2P-Graph CRDT 2P-Set for vertices, 2P-Set for edges resolution method: remove wins
  31. 31. A C B us-west-2 event us-east-1 state add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {} rmv_v: {A} add_e: {} rmv_e: {} 2P2P-Graph CRDT 2P-Set for vertices, 2P-Set for edges resolution method: remove wins
  32. 32. A C B us-west-2 event us-east-1 state add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {} rmv_v: {A} add_e: {} rmv_e: {} add_v: {A,B,C} rmv_v: {A} add_e: {AB,AC,BC} rmv_e: {AB,AC} 2P2P-Graph CRDT 2P-Set for vertices, 2P-Set for edges resolution method: remove wins
  33. 33. AD C B us-west-2 event us-east-1 state add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {} rmv_v: {A} add_e: {} rmv_e: {} add_v: {A,B,C} rmv_v: {A} add_e: {AB,AC,BC} rmv_e: {AB,AC} add_v: {D} rmv_v: {} add_e: {AD} rmv_e: {} 2P2P-Graph CRDT 2P-Set for vertices, 2P-Set for edges resolution method: remove wins
  34. 34. AD C B us-west-2 event us-east-1 state add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {A,B,C} rmv_v: {} add_e: {AB,AC,BC} rmv_e: {} add_v: {} rmv_v: {A} add_e: {} rmv_e: {} add_v: {A,B,C} rmv_v: {A} add_e: {AB,AC,BC} rmv_e: {AB,AC} add_v: {D} rmv_v: {} add_e: {AD} rmv_e: {} add_v: {A,B,C,D} rmv_v: {A} add_e: {AB,AC,BC,AD} rmv_e: {AB,AC,AD} 2P2P-Graph CRDT 2P-Set for vertices, 2P-Set for edges resolution method: remove wins
  35. 35. Sometimes the state won't converge easily ● Missing events (broken links) ○ integrity checks ○ repair ● Rerunning bulk events after downtime ○ Clocks: Event vs. Ingestion vs. Processor vs. Logical ○ Enrichment: IP address reputation changes daily 37/40
  36. 36. Background Reconciliator Reconciliation: Compare hash (Merkle) trees Compensation: Merge CRDT states client2 (read) us-west-2a S3 versioned us-west-2 client1 (read) us-east-1b S3 versioned us-east-1 Background Reconciliator 38/40
  37. 37. Takeaways ● Define business need for cross region Availability, Latency, Residency, Analytics ● Know your NoSQL Couchbase != Cassandra != Kafka ● Ask about CRDTs LWW-Register, MV-Register, 2P-Sets, 2P2P-Graphs ● Use Reconciliation ● Dedicated Fiber and Atomic clocks ARE COMING 40/40
  38. 38. “The Internet was designed to be an academic medium. It was not designed to handle this level of transactions” Fred Matteson @ schwab.com 1999
  39. 39. Advanced Topics ● ‫מרקחת‬ ‫לבית‬ ‫מאשר‬ ‫מטבחים‬ ‫לבית‬ ‫דומה‬ ‫יותר‬ ‫האמתי‬ ‫העולם‬ ● Multi Data Center Topologies ○ Star (SPOF, simple) ○ Ring (TLV ←→ Eilat ←→ Jerusalem←→ TLV) ○ Mesh (resilient, complex) ● Data Residency ○ Separate PII from data ○ Peek at other data centers ad-hoc

×