January 2019
Oleg Mürk
Designing Data
Intensive Applications
!1
Senior Systems Architect
January 2019 !2
YOUR FIRST DATA ARCHITECTURE
Microservice(s)
(Python, node.js)
Cache
(Redis, Memcache)
Database
(MySQL, Postgres)
Message Queue(s)

(RabbitMQ)
Deployment
(Ansible, Docker)
January 2019 !3
Requirements:
• Scalability
• Request Throughput & Latency
• Data Volume & Throughput & Latency
• Reliability (MTBF)
• Availability (Uptime)
• Maintainability
• Operability
• Evolvability
WITH GREAT SUCCESS COME … GREAT CHALLENGES
Challenges:
• DB partitioning for scalability
• More than 10K transactions/second
• More data than fits on large DB node
• DB replication for reliability
• DB fail-over for availability
• Data queries (joins) take too long
• Data joins across multiple DBs
• Process 100K events/second in 100ms
• Historical data approaches PB / year
• Need to run large analytical queries
• Need to reprocess historical events
Resource Scheduling & Monitoring & Logging
Batch
Processing
Stream
Processing
January 2019 !4
DATA ARCHITECTURE 5 YEARS LATER
Reactive Microservice(s)
Message
Queue(s)
Message / Event

Formats
In-Memory
Store
Event

Topic(s)
Operational
Store
Search
Store
Serving
Store
Analytical
Store
Object
Store
Workflow
Processing
January 2019 !5
• Message Broker
• Workers consuming from shared Queue
• Each message is processed by one worker
• Example: RabbitMQ (10K msgs/sec)
• Event Log
• Partitioned Event Topics
• Each consumer maintains his own offsets
• Example: Kafka (1M events/sec)
• Formats & Schema Evolution
• JSON & XML
• Protocol Buffers & Thrift & Avro
MESSAGES & EVENTS & FORMATS
• Scheduling tasks with ~1 minute granularity
• Directed Acyclic Graph (DAG)
• Tasks
• Dependencies
• External & time triggers
• Use cases
• Extract-Transform-Load jobs
• Aggregations, Reporting
• Scheduling Batch Processing jobs
• Examples
• Airflow, Luigi

January 2019 !6
WORKFLOW PROCESSING
• Executing distributed computation jobs
• Main abstraction
• Distributed partitioned dataset (eg Spark RDD)
• Data dependencies between dataset partitions (Narrow, Wide)
• Use-cases
• Extract-Transform-Load
• Analytical queries (SQL)
• Training/updating ML models
• Examples
• Hadoop Map/Reduce
• Spark
• Flink
January 2019 !7
BATCH PROCESSING
January 2019 !8
BATCH PROCESSING
• Processing events
• Less than 100ms-1sec latency
• More than 10K-100K events/sec
• Equivalent of SQL on event streams
• Filter, Map, Join, Group, Aggregate
• Use-cases
• Extract-Transform-Load
• Data enrichment
• Event detection
• Session analysis
• Examples
• Spark Streaming, Flink, Kafka Streams & KSQL
January 2019 !9
STREAM PROCESSING
• Event Sourcing
• Command Query Responsibility Separation
• Lambda & Kappa Architecture
• Change Data Capture
• Unbundled Database
January 2019 !10
IMMUTABLE EVENT LOG & DENORMALIZATION OF STATE
Replication
Stream
Leader Follower
Writes
Traditional Database
Streaming
Transform
Materialized

View
Unbundled Database
Streaming
Join
Streaming

Aggregation
Events
Transformed
Events
Workflow Scheduler
January 2019 !11
TRADITIONAL DATABASE USE CASE
Event Microservice
Operational Store
Notifications
Alerts
Alert Job
Events
Query Microservice
Queries
Command Microservice Writes
Queries
Queries
Message Broker
Report Job
January 2019 !12
UNBUNDLED DATABASE
Event StreamStreaming Transform Streaming Join
Event

Stream
External ConnectorChange Data Capture
Event Microservice
Event

Stream
Operational Store
Event

Stream
Event

Stream
Serving Store
Streaming Aggregation
Notifications
Alerts
External Connector
Command Microservice
Writes
Query Microservice
Reads
Message Broker
Event Stream
January 2019 !13
• Supported Workloads
• Cache, Transactional, Search, Serving, Analytical, Objects, etc
• Data Structures & Indexes
• Read vs Write Optimized
• Query types
• Replication
• Partitioning
• Transactions
• Consistency vs Availability
CHOOSING DATA STORE(S)
January 2019 !14
• In-Memory Cache / Data Grid
• Memcache, Redis, Ignite, Hazelcast
• SQL
• OLTP: MySQL, Postgres
• OLAP: Redshift, Vertica, HIVE
• NoSQL = Not yet SQL
• Key-Value Stores: Redis, Riak
• Wide Column Stores: Cassandra, HBase, RocksDB
• Document Stores: MongoDB, Elastic
• Specialized: Time-Series, Graph, RDF, Object-Oriented
• Object Stores
• S3, HDFS
DATA STORE ZOO
January 2019 !15
• B+ Trees
• Optimized for random reads
• Balanced tree with block size ~4KB
• Random writes are less efficient
• SQL OLTP DBs (MySQL, Postgres)
• Log Structured Merge Trees
• Optimized for random writes
• Hierarchical compaction scheme
• Random reads are less efficient
• NoSQL stores (Cassandra, HBase)
DATA STRUCTURES (FOR RANDOM READS/WRITES)
Last
Name
First
Name
E-mail Phone #
Street
Address
January 2019 !16
• Columnar representation
• Each column is compressed separately
• Each column chunk has metadata (eg min/max values)
• Can read only column chunks that match filter
• Formats: Parquet, ORC
• SQL OLAP DBs: Redshift, Vertica, HIVE
COLUMNAR STORAGE (FOR ANALYTICS)
Last
Name
First
Name
E-mail Phone #
Street
Address
Row Storage Columnar Storage
January 2019 !17
• Last 50 years of Database Research in 3 slides!
• Last 40 years of Distributed Systems Research in 1 slide!
AND NOW LADIES AND GENTLEMEN…
Leslie LamportMichael Stonebraker
• Single-Leader Replication
• Synchronous vs Asynchronous
• Failover & Fencing (epoch numbers, STONITH)

• Multi-Leader Replication
• Write conflict resolution (OT in Google Docs)

• Leaderless Replication
• Quorums for reading and writing: w + r > n (Cassandra)
• CRDT: Convergent Replicated Datatypes (Akka Cluster, Riak)

• Consistency of Reads
• Read-after-Write Consistency (S3 sometimes)
January 2019 !18
REPLICATION
January 2019 !19
• Partitioning of Key-Value Data
• By Hash of Key (Cassandra, Elastic)
• By Key Range (HBase)

• Skewed Workloads & Hot Spots
• Salting (HBase, S3)

• Partitioning & Indexes
• By document (Elastic, Cassandra)
• By term (Cassandra Materialized View)
• Re-partitioning / Over-partitioning
PARTITIONING
• Partitions & Replication
• Consistent Hashing (Elastic, Cassandra)
• Partition Leaders & Followers (Kafka)
• Region Servers (HBase)
January 2019 !20
• Consistency
• ACID: Atomicity & Consistency & Isolation & Durability
• BASE: Basically Available, Soft state, Eventual consistency

• Single vs Multi-Object Transactions
• Check-And-Put (HBase, Cassandra LWT)

• Isolation Levels
• Read Committed (RW Locks, MVCC)
• Repeatable Read (Snaphost Isolation)
• Serializable (2PL, Serializable Snaphost Isolation)
TRANSACTIONS
January 2019 !21
• Distributed Consensus
• FLP: not possible in asynchronous networks
• Zookeeper, Etcd, Bitcoin: … can do in practice!
• Atomic Broadcast aka Distributed Log
• On Network Partition
• CAP: pick either Consistency or Availability
• AP: Elastic, Cassandra, Akka Cluster
• CP: Kafka, HBase, Cassandra LWT
• HAT: think of consistency/latency requirements case-by-case
• NoSQL = Not Only SQL
• Google Spanner, CockroachDB, FaunaDB, FoundationDB
CONSISTENCY VS AVAILABILITY
January 2019 !22
DESIGNING DATA-INTENSIVE APPLICATIONS
BUILDING TRUST FOR THE CONNECTED WORLD

Designing data intensive applications - Oleg Mürk

  • 1.
    January 2019 Oleg Mürk DesigningData Intensive Applications !1 Senior Systems Architect
  • 2.
    January 2019 !2 YOURFIRST DATA ARCHITECTURE Microservice(s) (Python, node.js) Cache (Redis, Memcache) Database (MySQL, Postgres) Message Queue(s)
 (RabbitMQ) Deployment (Ansible, Docker)
  • 3.
    January 2019 !3 Requirements: •Scalability • Request Throughput & Latency • Data Volume & Throughput & Latency • Reliability (MTBF) • Availability (Uptime) • Maintainability • Operability • Evolvability WITH GREAT SUCCESS COME … GREAT CHALLENGES Challenges: • DB partitioning for scalability • More than 10K transactions/second • More data than fits on large DB node • DB replication for reliability • DB fail-over for availability • Data queries (joins) take too long • Data joins across multiple DBs • Process 100K events/second in 100ms • Historical data approaches PB / year • Need to run large analytical queries • Need to reprocess historical events
  • 4.
    Resource Scheduling &Monitoring & Logging Batch Processing Stream Processing January 2019 !4 DATA ARCHITECTURE 5 YEARS LATER Reactive Microservice(s) Message Queue(s) Message / Event
 Formats In-Memory Store Event
 Topic(s) Operational Store Search Store Serving Store Analytical Store Object Store Workflow Processing
  • 5.
    January 2019 !5 •Message Broker • Workers consuming from shared Queue • Each message is processed by one worker • Example: RabbitMQ (10K msgs/sec) • Event Log • Partitioned Event Topics • Each consumer maintains his own offsets • Example: Kafka (1M events/sec) • Formats & Schema Evolution • JSON & XML • Protocol Buffers & Thrift & Avro MESSAGES & EVENTS & FORMATS
  • 6.
    • Scheduling taskswith ~1 minute granularity • Directed Acyclic Graph (DAG) • Tasks • Dependencies • External & time triggers • Use cases • Extract-Transform-Load jobs • Aggregations, Reporting • Scheduling Batch Processing jobs • Examples • Airflow, Luigi
 January 2019 !6 WORKFLOW PROCESSING
  • 7.
    • Executing distributedcomputation jobs • Main abstraction • Distributed partitioned dataset (eg Spark RDD) • Data dependencies between dataset partitions (Narrow, Wide) • Use-cases • Extract-Transform-Load • Analytical queries (SQL) • Training/updating ML models • Examples • Hadoop Map/Reduce • Spark • Flink January 2019 !7 BATCH PROCESSING
  • 8.
  • 9.
    • Processing events •Less than 100ms-1sec latency • More than 10K-100K events/sec • Equivalent of SQL on event streams • Filter, Map, Join, Group, Aggregate • Use-cases • Extract-Transform-Load • Data enrichment • Event detection • Session analysis • Examples • Spark Streaming, Flink, Kafka Streams & KSQL January 2019 !9 STREAM PROCESSING
  • 10.
    • Event Sourcing •Command Query Responsibility Separation • Lambda & Kappa Architecture • Change Data Capture • Unbundled Database January 2019 !10 IMMUTABLE EVENT LOG & DENORMALIZATION OF STATE Replication Stream Leader Follower Writes Traditional Database Streaming Transform Materialized
 View Unbundled Database Streaming Join Streaming
 Aggregation Events Transformed Events
  • 11.
    Workflow Scheduler January 2019!11 TRADITIONAL DATABASE USE CASE Event Microservice Operational Store Notifications Alerts Alert Job Events Query Microservice Queries Command Microservice Writes Queries Queries Message Broker Report Job
  • 12.
    January 2019 !12 UNBUNDLEDDATABASE Event StreamStreaming Transform Streaming Join Event
 Stream External ConnectorChange Data Capture Event Microservice Event
 Stream Operational Store Event
 Stream Event
 Stream Serving Store Streaming Aggregation Notifications Alerts External Connector Command Microservice Writes Query Microservice Reads Message Broker Event Stream
  • 13.
    January 2019 !13 •Supported Workloads • Cache, Transactional, Search, Serving, Analytical, Objects, etc • Data Structures & Indexes • Read vs Write Optimized • Query types • Replication • Partitioning • Transactions • Consistency vs Availability CHOOSING DATA STORE(S)
  • 14.
    January 2019 !14 •In-Memory Cache / Data Grid • Memcache, Redis, Ignite, Hazelcast • SQL • OLTP: MySQL, Postgres • OLAP: Redshift, Vertica, HIVE • NoSQL = Not yet SQL • Key-Value Stores: Redis, Riak • Wide Column Stores: Cassandra, HBase, RocksDB • Document Stores: MongoDB, Elastic • Specialized: Time-Series, Graph, RDF, Object-Oriented • Object Stores • S3, HDFS DATA STORE ZOO
  • 15.
    January 2019 !15 •B+ Trees • Optimized for random reads • Balanced tree with block size ~4KB • Random writes are less efficient • SQL OLTP DBs (MySQL, Postgres) • Log Structured Merge Trees • Optimized for random writes • Hierarchical compaction scheme • Random reads are less efficient • NoSQL stores (Cassandra, HBase) DATA STRUCTURES (FOR RANDOM READS/WRITES)
  • 16.
    Last Name First Name E-mail Phone # Street Address January2019 !16 • Columnar representation • Each column is compressed separately • Each column chunk has metadata (eg min/max values) • Can read only column chunks that match filter • Formats: Parquet, ORC • SQL OLAP DBs: Redshift, Vertica, HIVE COLUMNAR STORAGE (FOR ANALYTICS) Last Name First Name E-mail Phone # Street Address Row Storage Columnar Storage
  • 17.
    January 2019 !17 •Last 50 years of Database Research in 3 slides! • Last 40 years of Distributed Systems Research in 1 slide! AND NOW LADIES AND GENTLEMEN… Leslie LamportMichael Stonebraker
  • 18.
    • Single-Leader Replication •Synchronous vs Asynchronous • Failover & Fencing (epoch numbers, STONITH)
 • Multi-Leader Replication • Write conflict resolution (OT in Google Docs)
 • Leaderless Replication • Quorums for reading and writing: w + r > n (Cassandra) • CRDT: Convergent Replicated Datatypes (Akka Cluster, Riak)
 • Consistency of Reads • Read-after-Write Consistency (S3 sometimes) January 2019 !18 REPLICATION
  • 19.
    January 2019 !19 •Partitioning of Key-Value Data • By Hash of Key (Cassandra, Elastic) • By Key Range (HBase)
 • Skewed Workloads & Hot Spots • Salting (HBase, S3)
 • Partitioning & Indexes • By document (Elastic, Cassandra) • By term (Cassandra Materialized View) • Re-partitioning / Over-partitioning PARTITIONING • Partitions & Replication • Consistent Hashing (Elastic, Cassandra) • Partition Leaders & Followers (Kafka) • Region Servers (HBase)
  • 20.
    January 2019 !20 •Consistency • ACID: Atomicity & Consistency & Isolation & Durability • BASE: Basically Available, Soft state, Eventual consistency
 • Single vs Multi-Object Transactions • Check-And-Put (HBase, Cassandra LWT)
 • Isolation Levels • Read Committed (RW Locks, MVCC) • Repeatable Read (Snaphost Isolation) • Serializable (2PL, Serializable Snaphost Isolation) TRANSACTIONS
  • 21.
    January 2019 !21 •Distributed Consensus • FLP: not possible in asynchronous networks • Zookeeper, Etcd, Bitcoin: … can do in practice! • Atomic Broadcast aka Distributed Log • On Network Partition • CAP: pick either Consistency or Availability • AP: Elastic, Cassandra, Akka Cluster • CP: Kafka, HBase, Cassandra LWT • HAT: think of consistency/latency requirements case-by-case • NoSQL = Not Only SQL • Google Spanner, CockroachDB, FaunaDB, FoundationDB CONSISTENCY VS AVAILABILITY
  • 22.
    January 2019 !22 DESIGNINGDATA-INTENSIVE APPLICATIONS
  • 23.
    BUILDING TRUST FORTHE CONNECTED WORLD