SlideShare a Scribd company logo
Design Patterns for Distributed
  Non-Relational Databases
                  aka
 Just Enough Distributed Systems To Be
               Dangerous
            (in 40 minutes)


             Todd Lipcon
              (@tlipcon)
                 Cloudera


            June 11, 2009
Introduction
Common Underlying Assumptions
Design Patterns
   Consistent Hashing
   Consistency Models
   Data Models
   Storage Layouts
   Log-Structured Merge Trees
Cluster Management
   Omniscient Master
   Gossip
Questions to Ask Presenters
Why We’re All Here


     Scaling up doesn’t work
     Scaling out with traditional RDBMSs isn’t so
     hot either
         Sharding scales, but you lose all the features that
         make RDBMSs useful!
         Sharding is operationally obnoxious.
     If we don’t need relational features, we want a
     distributed NRDBMS.
Closed-source NRDBMSs
“The Inspiration”




           Google BigTable
                    Applications: webtable, Reader, Maps, Blogger,
                    etc.
           Amazon Dynamo
                    Shopping Cart, ?
           Yahoo! PNUTS
                    Applications: ?
Data Interfaces
“This is the NOSQL meetup, right?”




          Every row has a key (PK)
          Key/value get/put
          multiget/multiput
          Range scan? With predicate pushdown?
          MapReduce?
          SQL?
Underlying Assumptions
Assumptions - Data Size


      The data does not fit on one node.
      The data may not fit on one rack.
      SANs are too expensive.

  Conclusion:
  The system must partition its data across many
  nodes.
Assumptions - Reliability
      The system must be highly available to serve
      web (and other) applications.
      Since the system runs on many nodes, nodes
      will crash during normal operation.
      Data must be safe even though disks and
      nodes will fail.

  Conclusion:
  The system must replicate each row to multiple
  nodes and remain available despite certain node and
  disk failure.
Assumptions - Performance
...and price thereof


            All systems we’re talking about today are
            meant for real-time use.
            95th or 99th percentile is more important than
            average latency
            Commodity hardware and slow disks.

     Conclusion:
     The system needs to perform well on commodity
     hardware, and maintain low latency even during
     recovery operations.
Design Patterns
Partitioning Schemes
“Where does a key live?”




          Given a key, we need to determine which
          node(s) it belongs on.
          If that node is down, we need to find another
          copy elsewhere.
          Difficulties:
                 Unbounded number of keys.
                 Dynamic cluster membership.
                 Node failures.
Consistent Hashing
Maintaining hashing in a dynamic cluster
Consistent Hashing
Key Placement
Consistency Models

     A consistency model determines rules for
     visibility and apparent order of updates.
     Example:
         Row X is replicated on nodes M and N
         Client A writes row X to node N
         Some period of time t elapses.
         Client B reads row X from node M
         Does client B see the write from client A?
     Consistency is a continuum with tradeoffs
Strict Consistency

     All read operations must return the data from
     the latest completed write operation, regardless
     of which replica the operations went to
     Implies either:
         All operations for a given row go to the same node
         (replication for availability)
         or nodes employ some kind of distributed
         transaction protocol (eg 2 Phase Commit or Paxos)
     CAP Theorem: Strict Consistency can’t be
     achieved at the same time as availability and
     partition-tolerance.
Eventual Consistency

     As t → ∞, readers will see writes.
     In a steady state, the system is guaranteed to
     eventually return the last written value
     For example: DNS, or MySQL Slave
     Replication (log shipping)
     Special cases of eventual consistency:
         Read-your-own-writes consistency (“sent mail”
         box)
         Causal consistency (if you write Y after reading X,
         anyone who reads Y sees X)
         gmail has RYOW but not causal!
Timestamps and Vector Clocks
Determining a history of a row



           Eventual consistency relies on deciding what
           value a row will eventually converge to
           In the case of two writers writing at “the
           same” time, this is difficult
           Timestamps are one solution, but rely on
           synchronized clocks and don’t capture causality
           Vector clocks are an alternative method of
           capturing order in a distributed system
Vector Clocks

     Definition:
         A vector clock is a tuple {t1 , t2 , ..., tn } of clock
         values from each node
         v1 < v2 if:
               For all i, v1i ≤ v2i
               For at least one i, v1i < v2i
         v1 < v2 implies global time ordering of events
     When data is written from node i, it sets ti to
     its clock value.
     This allows eventual consistency to resolve
     consistency between writes on multiple replicas.
Data Models
What’s in a row?




          Primary Key → Value
          Value could be:
                   Blob
                   Structured (set of columns)
                   Semi-structured (set of column families with
                   arbitrary columns, eg linkto:<url> in webtable)
                   Each has advantages and disadvantages
          Secondary Indexes? Tables/namespaces?
Multi-Version Storage
Using Timestamps for a 3rd dimension



          Each table cell has a timestamp
          Timestamps don’t necessarily need to
          correspond to real life
          Multiple versions (and tombstones) can exist
          concurrently for a given row
          Reads may return “most recent”, “most recent
          before T”, etc. (free snapshots)
          System may provide optimistic concurrency
          control with compare-and-swap on timestamps
Storage Layouts
How do we lay out rows and columns on disk?




          Determines performance of different access
          patterns
          Storage layout maps directly to disk access
          patterns
          Fast writes? Fast reads? Fast scans?
          Whole-row access or subsets of columns?
Row-based Storage




     Pros:
         Good locality of access (on disk and in cache) of
         different columns
         Read/write of a single row is a single IO operation.
     Cons:
         But if you want to scan only one column, you still
         read all.
Columnar Storage




     Pros:
         Data for a given column is stored sequentially
         Scanning a single column (eg aggregate queries) is
         fast
     Cons:
         Reading a single row may seek once per column.
Columnar Storage with Locality Groups




     Columns are organized into families (“locality
     groups”)
     Benefits of row-based layout within a group.
     Benefits of column-based - don’t have to read
     groups you don’t care about.
Log Structured Merge Trees
aka “The BigTable model”

           Random IO for writes is bad (and impossible in
           some DFSs)
           LSM Trees convert random writes to sequential
           writes
           Writes go to a commit log and in-memory
           storage (Memtable)
           The Memtable is occasionally flushed to disk
           (SSTable)
           The disk stores are periodically compacted
    P. E. O’Neil, E. Cheng, D. Gawlick, and E. J. O’Neil. The log-structured merge-tree

    (LSM-tree). Acta Informatica. 1996.
LSM Data Layout
LSM Write Path
LSM Read Path
LSM Read Path + Bloom Filters
LSM Memtable Flush
LSM Compaction
Cluster Management

     Clients need to know where to find data
     (consistent hashing tokens, etc)
     Internal nodes may need to find each other as
     well
     Since nodes may fail and recover, a
     configuration file doesn’t really suffice
     We need a way of keeping some kind of
     consistent view of the cluster state
Omniscient Master

     When nodes join/leave or change state, they
     talk to a master
     That master holds the authoritative view of the
     world
     Pros: simplicity, single consistent view of the
     cluster
     Cons: potential SPOF unless master is made
     highly available. Not partition-tolerant.
Gossip
     Gossip is one method to propagate a view of
     cluster status.
     Every t seconds, on each node:
         The node selects some other node to chat with.
         The node reconciles its view of the cluster with its
         gossip buddy.
         Each node maintains a “timestamp” for itself and
         for the most recent information it has from every
         other node.
     Information about cluster state spreads in
     O(lgn) rounds (eventual consistency)
     Scalable and no SPOF, but state is only
     eventually consistent
Gossip - Initial State
Gossip - Round 1
Gossip - Round 2
Gossip - Round 3
Gossip - Round 4
Questions to Ask Presenters
Scalability and Reliability

      What are the scaling bottlenecks? How does it
      react when overloaded?
      Are there any single points of failure?
      When nodes fail, does the system maintain
      availability of all data?
      Does the system automatically re-replicate
      when replicas are lost?
      When new nodes are added, does the system
      automatically rebalance data?
Performance

     What’s the goal? Batch throughput or request
     latency?
     How many seeks for reads? For writes? How
     many net RTTs?
     What 99th percentile latencies have been
     measured in practice?
     How do failures impact serving latencies?
     What throughput has been measured in
     practice for bulk loads?
Consistency

     What consistency model does the system
     provide?
     What situations would cause a lapse of
     consistency, if any?
     Can consistency semantics be tweaked by
     configuration settings?
     Is there a way to do compare-and-swap on row
     contents for optimistic locking? Multirow?
Cluster Management and Topology

     Does the system have a single master? Does it
     use gossip to spread cluster management data?
     Can it withstand network partitions and still
     provide some level of service?
     Can it be deployed across multiple datacenters
     for disaster recovery?
     Can nodes be commissioned/decomissioned
     automatically without downtime?
     Operational hooks for monitoring and metrics?
Data Model and Storage

     What data model and storage system does the
     system provide?
     Is it pluggable?
     What IO patterns does the system cause under
     different workloads?
     Is the system best at random or sequential
     access? For read-mostly or write-mostly?
     Are there practical limits on key, value, or row
     sizes?
     Is compression available?
Data Access Methods


     What methods exist for accessing data? Can I
     access it from language X?
     Is there a way to perform filtering or selection
     at the server side?
     Are there bulk load tools to get data in/out
     efficiently?
     Is there a provision for data backup/restore?
Real Life Considerations
(I was talking about fake life in the first 45 slides)

            Who uses this system? How big are the
            clusters it’s deployed on, and what kind of load
            do they handle?
            Who develops this system? Is this a community
            project or run by a single organization? Are
            outside contributions regularly accepted?
            Who supports this system? Is there an active
            community who will help me deploy it and
            debug issues? Docs?
            What is the open source license?
            What is the development roadmap?
Questions?
http://cloudera-todd.s3.amazonaws.com/nosql.pdf

More Related Content

What's hot

[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
NAVER D2
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
Ceph Community
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
MIJIN AN
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Flink Forward
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
Databricks
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
Erik Krogen
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
 
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
DataStax
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
KafkaZone
 
Persistent Storage with Containers with Kubernetes & OpenShift
Persistent Storage with Containers with Kubernetes & OpenShiftPersistent Storage with Containers with Kubernetes & OpenShift
Persistent Storage with Containers with Kubernetes & OpenShift
Red Hat Events
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 
Redis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRedis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech Talk
Red Hat Developers
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
AIMDek Technologies
 
Container Orchestration using Kubernetes
Container Orchestration using KubernetesContainer Orchestration using Kubernetes
Container Orchestration using Kubernetes
Hesham Amin
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
DataWorks Summit
 

What's hot (20)

[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Persistent Storage with Containers with Kubernetes & OpenShift
Persistent Storage with Containers with Kubernetes & OpenShiftPersistent Storage with Containers with Kubernetes & OpenShift
Persistent Storage with Containers with Kubernetes & OpenShift
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Redis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRedis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech Talk
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Container Orchestration using Kubernetes
Container Orchestration using KubernetesContainer Orchestration using Kubernetes
Container Orchestration using Kubernetes
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 

Viewers also liked

Scalable Databases - From Relational Databases To Polyglot Persistence
Scalable Databases - From Relational Databases To Polyglot PersistenceScalable Databases - From Relational Databases To Polyglot Persistence
Scalable Databases - From Relational Databases To Polyglot Persistence
Sergio Bossa
 
Why Projects Fail: Obstacles and Solutions
Why Projects Fail: Obstacles and SolutionsWhy Projects Fail: Obstacles and Solutions
Why Projects Fail: Obstacles and Solutions
Michael Krigsman
 
Scaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approachScaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approach
Jinal Jhaveri
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér
 
Server load balancer ppt
Server load balancer pptServer load balancer ppt
Server load balancer ppt
Shilpi Tandon
 
Developing web apps using Erlang-Web
Developing web apps using Erlang-WebDeveloping web apps using Erlang-Web
Developing web apps using Erlang-Web
fanqstefan
 
Node.js and NoSQL
Node.js and NoSQLNode.js and NoSQL
Node.js and NoSQL
Nicholas McClay
 
Elasticsearch - Zero to Hero
Elasticsearch - Zero to HeroElasticsearch - Zero to Hero
Elasticsearch - Zero to Hero
Daniel Ziv
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
Ram kumar
 
Modelando aplicação em documento - MongoDB
Modelando aplicação em documento - MongoDBModelando aplicação em documento - MongoDB
Modelando aplicação em documento - MongoDB
Thiago Avelino
 
GOOGLE: Designs, Lessons and Advice from Building Large Distributed Systems
GOOGLE: Designs, Lessons and Advice from Building Large   Distributed Systems GOOGLE: Designs, Lessons and Advice from Building Large   Distributed Systems
GOOGLE: Designs, Lessons and Advice from Building Large Distributed Systems
xlight
 
Messaging With Apache ActiveMQ
Messaging With Apache ActiveMQMessaging With Apache ActiveMQ
Messaging With Apache ActiveMQ
Bruce Snyder
 
ActiveMQ In Action
ActiveMQ In ActionActiveMQ In Action
ActiveMQ In Action
Bruce Snyder
 
Apache ActiveMQ - Enterprise messaging in action
Apache ActiveMQ - Enterprise messaging in actionApache ActiveMQ - Enterprise messaging in action
Apache ActiveMQ - Enterprise messaging in action
dejanb
 
Relational vs. Non-Relational
Relational vs. Non-RelationalRelational vs. Non-Relational
Relational vs. Non-Relational
PostgreSQL Experts, Inc.
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
Eonblast
 
LOAD BALANCING ALGORITHMS
LOAD BALANCING ALGORITHMSLOAD BALANCING ALGORITHMS
LOAD BALANCING ALGORITHMS
tanmayshah95
 
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Spark Summit
 
Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012
Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012
Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012
Alexandre Morgaut
 
Yoshis caanoo emulator_fact_sheets_v03
Yoshis caanoo emulator_fact_sheets_v03Yoshis caanoo emulator_fact_sheets_v03
Yoshis caanoo emulator_fact_sheets_v03
Rubén de Celis Hernández
 

Viewers also liked (20)

Scalable Databases - From Relational Databases To Polyglot Persistence
Scalable Databases - From Relational Databases To Polyglot PersistenceScalable Databases - From Relational Databases To Polyglot Persistence
Scalable Databases - From Relational Databases To Polyglot Persistence
 
Why Projects Fail: Obstacles and Solutions
Why Projects Fail: Obstacles and SolutionsWhy Projects Fail: Obstacles and Solutions
Why Projects Fail: Obstacles and Solutions
 
Scaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approachScaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approach
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Server load balancer ppt
Server load balancer pptServer load balancer ppt
Server load balancer ppt
 
Developing web apps using Erlang-Web
Developing web apps using Erlang-WebDeveloping web apps using Erlang-Web
Developing web apps using Erlang-Web
 
Node.js and NoSQL
Node.js and NoSQLNode.js and NoSQL
Node.js and NoSQL
 
Elasticsearch - Zero to Hero
Elasticsearch - Zero to HeroElasticsearch - Zero to Hero
Elasticsearch - Zero to Hero
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Modelando aplicação em documento - MongoDB
Modelando aplicação em documento - MongoDBModelando aplicação em documento - MongoDB
Modelando aplicação em documento - MongoDB
 
GOOGLE: Designs, Lessons and Advice from Building Large Distributed Systems
GOOGLE: Designs, Lessons and Advice from Building Large   Distributed Systems GOOGLE: Designs, Lessons and Advice from Building Large   Distributed Systems
GOOGLE: Designs, Lessons and Advice from Building Large Distributed Systems
 
Messaging With Apache ActiveMQ
Messaging With Apache ActiveMQMessaging With Apache ActiveMQ
Messaging With Apache ActiveMQ
 
ActiveMQ In Action
ActiveMQ In ActionActiveMQ In Action
ActiveMQ In Action
 
Apache ActiveMQ - Enterprise messaging in action
Apache ActiveMQ - Enterprise messaging in actionApache ActiveMQ - Enterprise messaging in action
Apache ActiveMQ - Enterprise messaging in action
 
Relational vs. Non-Relational
Relational vs. Non-RelationalRelational vs. Non-Relational
Relational vs. Non-Relational
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
 
LOAD BALANCING ALGORITHMS
LOAD BALANCING ALGORITHMSLOAD BALANCING ALGORITHMS
LOAD BALANCING ALGORITHMS
 
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
 
Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012
Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012
Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012
 
Yoshis caanoo emulator_fact_sheets_v03
Yoshis caanoo emulator_fact_sheets_v03Yoshis caanoo emulator_fact_sheets_v03
Yoshis caanoo emulator_fact_sheets_v03
 

Similar to Design Patterns for Distributed Non-Relational Databases

Design Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databasesDesign Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databases
lovingprince58
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed Storage
Nilesh Salpe
 
Cassandra
CassandraCassandra
Cassandra
Upaang Saxena
 
Lecture-04-Principles of data management.pdf
Lecture-04-Principles of data management.pdfLecture-04-Principles of data management.pdf
Lecture-04-Principles of data management.pdf
manimozhi98
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
Suresh Parmar
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
Nathan Milford
 
Cassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User GroupCassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User Group
Adam Hutson
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
Ding Li
 
Cassandra
CassandraCassandra
Cassandra
julesbravo
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
jlacefie
 
MySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspectiveMySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspective
Ulf Wendel
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011
sandeep_tata
 
Distributed systems and scalability rules
Distributed systems and scalability rulesDistributed systems and scalability rules
Distributed systems and scalability rules
Oleg Tsal-Tsalko
 
Distributed Algorithms
Distributed AlgorithmsDistributed Algorithms
Distributed Algorithms
913245857
 
Everything you always wanted to know about Distributed databases, at devoxx l...
Everything you always wanted to know about Distributed databases, at devoxx l...Everything you always wanted to know about Distributed databases, at devoxx l...
Everything you always wanted to know about Distributed databases, at devoxx l...
javier ramirez
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web Systems
Vineet Gupta
 
No sql (not only sql)
No sql                 (not only sql)No sql                 (not only sql)
No sql (not only sql)
Priyodarshini Dhar
 
Highly available distributed databases, how they work, javier ramirez at teowaki
Highly available distributed databases, how they work, javier ramirez at teowakiHighly available distributed databases, how they work, javier ramirez at teowaki
Highly available distributed databases, how they work, javier ramirez at teowaki
javier ramirez
 

Similar to Design Patterns for Distributed Non-Relational Databases (20)

Design Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databasesDesign Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databases
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed Storage
 
Cassandra
CassandraCassandra
Cassandra
 
Lecture-04-Principles of data management.pdf
Lecture-04-Principles of data management.pdfLecture-04-Principles of data management.pdf
Lecture-04-Principles of data management.pdf
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Cassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User GroupCassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User Group
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Cassandra
CassandraCassandra
Cassandra
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
 
MySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspectiveMySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspective
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011
 
Distributed systems and scalability rules
Distributed systems and scalability rulesDistributed systems and scalability rules
Distributed systems and scalability rules
 
Distributed Algorithms
Distributed AlgorithmsDistributed Algorithms
Distributed Algorithms
 
Everything you always wanted to know about Distributed databases, at devoxx l...
Everything you always wanted to know about Distributed databases, at devoxx l...Everything you always wanted to know about Distributed databases, at devoxx l...
Everything you always wanted to know about Distributed databases, at devoxx l...
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web Systems
 
No sql (not only sql)
No sql                 (not only sql)No sql                 (not only sql)
No sql (not only sql)
 
Highly available distributed databases, how they work, javier ramirez at teowaki
Highly available distributed databases, how they work, javier ramirez at teowakiHighly available distributed databases, how they work, javier ramirez at teowaki
Highly available distributed databases, how they work, javier ramirez at teowaki
 

Recently uploaded

UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
FIDO Alliance
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
DianaGray10
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
ankush9927
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
SynapseIndia
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
alexjohnson7307
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
ZachWylie3
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
Zilliz
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
SubhamMandal40
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
David Wilson
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
alexjohnson7307
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 

Recently uploaded (20)

UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 

Design Patterns for Distributed Non-Relational Databases

  • 1. Design Patterns for Distributed Non-Relational Databases aka Just Enough Distributed Systems To Be Dangerous (in 40 minutes) Todd Lipcon (@tlipcon) Cloudera June 11, 2009
  • 2. Introduction Common Underlying Assumptions Design Patterns Consistent Hashing Consistency Models Data Models Storage Layouts Log-Structured Merge Trees Cluster Management Omniscient Master Gossip Questions to Ask Presenters
  • 3. Why We’re All Here Scaling up doesn’t work Scaling out with traditional RDBMSs isn’t so hot either Sharding scales, but you lose all the features that make RDBMSs useful! Sharding is operationally obnoxious. If we don’t need relational features, we want a distributed NRDBMS.
  • 4. Closed-source NRDBMSs “The Inspiration” Google BigTable Applications: webtable, Reader, Maps, Blogger, etc. Amazon Dynamo Shopping Cart, ? Yahoo! PNUTS Applications: ?
  • 5. Data Interfaces “This is the NOSQL meetup, right?” Every row has a key (PK) Key/value get/put multiget/multiput Range scan? With predicate pushdown? MapReduce? SQL?
  • 7. Assumptions - Data Size The data does not fit on one node. The data may not fit on one rack. SANs are too expensive. Conclusion: The system must partition its data across many nodes.
  • 8. Assumptions - Reliability The system must be highly available to serve web (and other) applications. Since the system runs on many nodes, nodes will crash during normal operation. Data must be safe even though disks and nodes will fail. Conclusion: The system must replicate each row to multiple nodes and remain available despite certain node and disk failure.
  • 9. Assumptions - Performance ...and price thereof All systems we’re talking about today are meant for real-time use. 95th or 99th percentile is more important than average latency Commodity hardware and slow disks. Conclusion: The system needs to perform well on commodity hardware, and maintain low latency even during recovery operations.
  • 11. Partitioning Schemes “Where does a key live?” Given a key, we need to determine which node(s) it belongs on. If that node is down, we need to find another copy elsewhere. Difficulties: Unbounded number of keys. Dynamic cluster membership. Node failures.
  • 14. Consistency Models A consistency model determines rules for visibility and apparent order of updates. Example: Row X is replicated on nodes M and N Client A writes row X to node N Some period of time t elapses. Client B reads row X from node M Does client B see the write from client A? Consistency is a continuum with tradeoffs
  • 15. Strict Consistency All read operations must return the data from the latest completed write operation, regardless of which replica the operations went to Implies either: All operations for a given row go to the same node (replication for availability) or nodes employ some kind of distributed transaction protocol (eg 2 Phase Commit or Paxos) CAP Theorem: Strict Consistency can’t be achieved at the same time as availability and partition-tolerance.
  • 16. Eventual Consistency As t → ∞, readers will see writes. In a steady state, the system is guaranteed to eventually return the last written value For example: DNS, or MySQL Slave Replication (log shipping) Special cases of eventual consistency: Read-your-own-writes consistency (“sent mail” box) Causal consistency (if you write Y after reading X, anyone who reads Y sees X) gmail has RYOW but not causal!
  • 17. Timestamps and Vector Clocks Determining a history of a row Eventual consistency relies on deciding what value a row will eventually converge to In the case of two writers writing at “the same” time, this is difficult Timestamps are one solution, but rely on synchronized clocks and don’t capture causality Vector clocks are an alternative method of capturing order in a distributed system
  • 18. Vector Clocks Definition: A vector clock is a tuple {t1 , t2 , ..., tn } of clock values from each node v1 < v2 if: For all i, v1i ≤ v2i For at least one i, v1i < v2i v1 < v2 implies global time ordering of events When data is written from node i, it sets ti to its clock value. This allows eventual consistency to resolve consistency between writes on multiple replicas.
  • 19. Data Models What’s in a row? Primary Key → Value Value could be: Blob Structured (set of columns) Semi-structured (set of column families with arbitrary columns, eg linkto:<url> in webtable) Each has advantages and disadvantages Secondary Indexes? Tables/namespaces?
  • 20. Multi-Version Storage Using Timestamps for a 3rd dimension Each table cell has a timestamp Timestamps don’t necessarily need to correspond to real life Multiple versions (and tombstones) can exist concurrently for a given row Reads may return “most recent”, “most recent before T”, etc. (free snapshots) System may provide optimistic concurrency control with compare-and-swap on timestamps
  • 21. Storage Layouts How do we lay out rows and columns on disk? Determines performance of different access patterns Storage layout maps directly to disk access patterns Fast writes? Fast reads? Fast scans? Whole-row access or subsets of columns?
  • 22. Row-based Storage Pros: Good locality of access (on disk and in cache) of different columns Read/write of a single row is a single IO operation. Cons: But if you want to scan only one column, you still read all.
  • 23. Columnar Storage Pros: Data for a given column is stored sequentially Scanning a single column (eg aggregate queries) is fast Cons: Reading a single row may seek once per column.
  • 24. Columnar Storage with Locality Groups Columns are organized into families (“locality groups”) Benefits of row-based layout within a group. Benefits of column-based - don’t have to read groups you don’t care about.
  • 25. Log Structured Merge Trees aka “The BigTable model” Random IO for writes is bad (and impossible in some DFSs) LSM Trees convert random writes to sequential writes Writes go to a commit log and in-memory storage (Memtable) The Memtable is occasionally flushed to disk (SSTable) The disk stores are periodically compacted P. E. O’Neil, E. Cheng, D. Gawlick, and E. J. O’Neil. The log-structured merge-tree (LSM-tree). Acta Informatica. 1996.
  • 29. LSM Read Path + Bloom Filters
  • 32. Cluster Management Clients need to know where to find data (consistent hashing tokens, etc) Internal nodes may need to find each other as well Since nodes may fail and recover, a configuration file doesn’t really suffice We need a way of keeping some kind of consistent view of the cluster state
  • 33. Omniscient Master When nodes join/leave or change state, they talk to a master That master holds the authoritative view of the world Pros: simplicity, single consistent view of the cluster Cons: potential SPOF unless master is made highly available. Not partition-tolerant.
  • 34. Gossip Gossip is one method to propagate a view of cluster status. Every t seconds, on each node: The node selects some other node to chat with. The node reconciles its view of the cluster with its gossip buddy. Each node maintains a “timestamp” for itself and for the most recent information it has from every other node. Information about cluster state spreads in O(lgn) rounds (eventual consistency) Scalable and no SPOF, but state is only eventually consistent
  • 40. Questions to Ask Presenters
  • 41. Scalability and Reliability What are the scaling bottlenecks? How does it react when overloaded? Are there any single points of failure? When nodes fail, does the system maintain availability of all data? Does the system automatically re-replicate when replicas are lost? When new nodes are added, does the system automatically rebalance data?
  • 42. Performance What’s the goal? Batch throughput or request latency? How many seeks for reads? For writes? How many net RTTs? What 99th percentile latencies have been measured in practice? How do failures impact serving latencies? What throughput has been measured in practice for bulk loads?
  • 43. Consistency What consistency model does the system provide? What situations would cause a lapse of consistency, if any? Can consistency semantics be tweaked by configuration settings? Is there a way to do compare-and-swap on row contents for optimistic locking? Multirow?
  • 44. Cluster Management and Topology Does the system have a single master? Does it use gossip to spread cluster management data? Can it withstand network partitions and still provide some level of service? Can it be deployed across multiple datacenters for disaster recovery? Can nodes be commissioned/decomissioned automatically without downtime? Operational hooks for monitoring and metrics?
  • 45. Data Model and Storage What data model and storage system does the system provide? Is it pluggable? What IO patterns does the system cause under different workloads? Is the system best at random or sequential access? For read-mostly or write-mostly? Are there practical limits on key, value, or row sizes? Is compression available?
  • 46. Data Access Methods What methods exist for accessing data? Can I access it from language X? Is there a way to perform filtering or selection at the server side? Are there bulk load tools to get data in/out efficiently? Is there a provision for data backup/restore?
  • 47. Real Life Considerations (I was talking about fake life in the first 45 slides) Who uses this system? How big are the clusters it’s deployed on, and what kind of load do they handle? Who develops this system? Is this a community project or run by a single organization? Are outside contributions regularly accepted? Who supports this system? Is there an active community who will help me deploy it and debug issues? Docs? What is the open source license? What is the development roadmap?