NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
This document provides an overview of patterns for scalability, availability, and stability in distributed systems. It discusses general recommendations like immutability and referential transparency. It covers scalability trade-offs around performance vs scalability, latency vs throughput, and availability vs consistency. It then describes various patterns for scalability including managing state through partitioning, caching, sharding databases, and using distributed caching. It also covers patterns for managing behavior through event-driven architecture, compute grids, load balancing, and parallel computing. Availability patterns like fail-over, replication, and fault tolerance are discussed. The document provides examples of popular technologies that implement many of these patterns.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
This document provides an overview of patterns for scalability, availability, and stability in distributed systems. It discusses general recommendations like immutability and referential transparency. It covers scalability trade-offs around performance vs scalability, latency vs throughput, and availability vs consistency. It then describes various patterns for scalability including managing state through partitioning, caching, sharding databases, and using distributed caching. It also covers patterns for managing behavior through event-driven architecture, compute grids, load balancing, and parallel computing. Availability patterns like fail-over, replication, and fault tolerance are discussed. The document provides examples of popular technologies that implement many of these patterns.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
“not only SQL.”
NoSQL databases are databases store data in a format other than relational tables.
NoSQL databases or non-relational databases don’t store relationship data well.
Snowflake is an analytic data warehouse provided as software-as-a-service (SaaS). It uses a unique architecture designed for the cloud, with a shared-disk database and shared-nothing architecture. Snowflake's architecture consists of three layers - the database layer, query processing layer, and cloud services layer - which are deployed and managed entirely on cloud platforms like AWS and Azure. Snowflake offers different editions like Standard, Premier, Enterprise, and Enterprise for Sensitive Data that provide additional features, support, and security capabilities.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
This document provides an introduction to NoSQL databases. It discusses the history and limitations of relational databases that led to the development of NoSQL databases. The key motivations for NoSQL databases are that they can handle big data, provide better scalability and flexibility than relational databases. The document describes some core NoSQL concepts like the CAP theorem and different types of NoSQL databases like key-value, columnar, document and graph databases. It also outlines some remaining research challenges in the area of NoSQL databases.
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
In this talk, Dremio Developer Advocate, Alex Merced, discusses strategies for migrating your existing data over to Apache Iceberg. He'll go over the following:
How to Migrate Hive, Delta Lake, JSON, and CSV sources to Apache Iceberg
Pros and Cons of an In-place or Shadow Migration
Migrating between Apache Iceberg catalogs Hive/Glue -- Arctic/Nessie
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
ETL extracts raw data from sources, transforms it on a separate server, and loads it into a target database. ELT loads raw data directly into a data warehouse, where data cleansing, enrichment, and transformations occur. While ETL has been used longer and has more supporting tools, ELT allows for faster queries, greater flexibility, and takes advantage of cloud data warehouse capabilities by performing transformations within the warehouse. However, ELT can present greater security risks and increased latency compared to ETL.
This document discusses Delta Change Data Feed (CDF), which allows capturing changes made to Delta tables. It describes how CDF works by storing change events like inserts, updates and deletes. It also outlines how CDF can be used to improve ETL pipelines, unify batch and streaming workflows, and meet regulatory needs. The document provides examples of enabling CDF, querying change data and storing the change events. It concludes by offering a demo of CDF in Jupyter notebooks.
Kappa vs Lambda Architectures and Technology ComparisonKai Wähner
Real-time data beats slow data. That’s true for almost every use case. Nevertheless, enterprise architects build new infrastructures with the Lambda architecture that includes separate batch and real-time layers.
This video explores why a single real-time pipeline, called Kappa architecture, is the better fit for many enterprise architectures. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively without the need for a Lambda architecture.
The main focus of the discussion is on Apache Kafka (and its ecosystem) as the de facto standard for event streaming to process data in motion (the key concept of Kappa), but the video also compares various technologies and vendors such as Confluent, Cloudera, IBM Red Hat, Apache Flink, Apache Pulsar, AWS Kinesis, Amazon MSK, Azure Event Hubs, Google Pub Sub, and more.
Video recording of this presentation:
https://youtu.be/j7D29eyysDw
Further reading:
https://www.kai-waehner.de/blog/2021/09/23/real-time-kappa-architecture-mainstream-replacing-batch-lambda/
https://www.kai-waehner.de/blog/2021/04/20/comparison-open-source-apache-kafka-vs-confluent-cloudera-red-hat-amazon-msk-cloud/
https://www.kai-waehner.de/blog/2021/05/09/kafka-api-de-facto-standard-event-streaming-like-amazon-s3-object-storage/
This document discusses NoSQL and the CAP theorem. It begins with an introduction of the presenter and an overview of topics to be covered: What is NoSQL and the CAP theorem. It then defines NoSQL, provides examples of major NoSQL categories (document, graph, key-value, and wide-column stores), and explains why NoSQL is used, including to handle large, dynamic, and distributed data. The document also explains the CAP theorem, which states that a distributed data store can only satisfy two of three properties: consistency, availability, and partition tolerance. It provides examples of how to choose availability over consistency or vice versa. Finally, it concludes that both SQL and NoSQL have valid use cases and a combination
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
This document provides an overview and introduction to NoSQL databases. It begins with an agenda that explores key-value, document, column family, and graph databases. For each type, 1-2 specific databases are discussed in more detail, including their origins, features, and use cases. Key databases mentioned include Voldemort, CouchDB, MongoDB, HBase, Cassandra, and Neo4j. The document concludes with references for further reading on NoSQL databases and related topics.
Change Data Streaming Patterns for Microservices With Debezium confluent
(Gunnar Morling, RedHat) Kafka Summit SF 2018
Debezium (noun | de·be·zi·um | /dɪ:ˈbɪ:ziːəm/): secret sauce for change data capture (CDC) streaming changes from your datastore that enables you to solve multiple challenges: synchronizing data between microservices, gradually extracting microservices from existing monoliths, maintaining different read models in CQRS-style architectures, updating caches and full-text indexes and feeding operational data to your analytics tools
Join this session to learn what CDC is about, how it can be implemented using Debezium, an open source CDC solution based on Apache Kafka and how it can be utilized for your microservices. Find out how Debezium captures all the changes from datastores such as MySQL, PostgreSQL and MongoDB, how to react to the change events in near real time and how Debezium is designed to not compromise on data correctness and completeness also if things go wrong. In a live demo we’ll show how to set up a change data stream out of your application’s database without any code changes needed. You’ll see how to sink the change events into other databases and how to push data changes to your clients using WebSockets.
This document discusses different types of distributed databases. It covers data models like relational, aggregate-oriented, key-value, and document models. It also discusses different distribution models like sharding and replication. Consistency models for distributed databases are explained including eventual consistency and the CAP theorem. Key-value stores are described in more detail as a simple but widely used data model with features like consistency, scaling, and suitable use cases. Specific key-value databases like Redis, Riak, and DynamoDB are mentioned.
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
This document provides an overview of NoSQL databases and compares them to relational databases. It discusses the different types of NoSQL databases including key-value stores, document databases, wide column stores, and graph databases. It also covers some common concepts like eventual consistency, CAP theorem, and MapReduce. While NoSQL databases provide better scalability for massive datasets, relational databases offer more mature tools and strong consistency models.
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
Justin and I gave this talk in QCon SF 2014 about the Mantis, a stream processing system that features a reactive programming API, auto scaling, and stream locality
Monitoring at scale - Intuitive dashboard designLorenzo Alberton
At a certain scale, millions of events happen every second, and all of them are important to evaluate the health of the system. If not handled correctly, such a volume of information can overwhelm both the infrastructure that needs to support them, and people who have to make a sense out of thousands of signals and make decisions upon them, fast. By understanding how our rational mind works, how people process information, we can present data so it's more evident and intuitive. This talk will explain how to collect useful metrics, and to create the perfect monitoring dashboard to organise and display them, letting our intuition operate automatically and quickly, and saving attention and mental effort to activities that demand it.
“not only SQL.”
NoSQL databases are databases store data in a format other than relational tables.
NoSQL databases or non-relational databases don’t store relationship data well.
Snowflake is an analytic data warehouse provided as software-as-a-service (SaaS). It uses a unique architecture designed for the cloud, with a shared-disk database and shared-nothing architecture. Snowflake's architecture consists of three layers - the database layer, query processing layer, and cloud services layer - which are deployed and managed entirely on cloud platforms like AWS and Azure. Snowflake offers different editions like Standard, Premier, Enterprise, and Enterprise for Sensitive Data that provide additional features, support, and security capabilities.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
This document provides an introduction to NoSQL databases. It discusses the history and limitations of relational databases that led to the development of NoSQL databases. The key motivations for NoSQL databases are that they can handle big data, provide better scalability and flexibility than relational databases. The document describes some core NoSQL concepts like the CAP theorem and different types of NoSQL databases like key-value, columnar, document and graph databases. It also outlines some remaining research challenges in the area of NoSQL databases.
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
In this talk, Dremio Developer Advocate, Alex Merced, discusses strategies for migrating your existing data over to Apache Iceberg. He'll go over the following:
How to Migrate Hive, Delta Lake, JSON, and CSV sources to Apache Iceberg
Pros and Cons of an In-place or Shadow Migration
Migrating between Apache Iceberg catalogs Hive/Glue -- Arctic/Nessie
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
ETL extracts raw data from sources, transforms it on a separate server, and loads it into a target database. ELT loads raw data directly into a data warehouse, where data cleansing, enrichment, and transformations occur. While ETL has been used longer and has more supporting tools, ELT allows for faster queries, greater flexibility, and takes advantage of cloud data warehouse capabilities by performing transformations within the warehouse. However, ELT can present greater security risks and increased latency compared to ETL.
This document discusses Delta Change Data Feed (CDF), which allows capturing changes made to Delta tables. It describes how CDF works by storing change events like inserts, updates and deletes. It also outlines how CDF can be used to improve ETL pipelines, unify batch and streaming workflows, and meet regulatory needs. The document provides examples of enabling CDF, querying change data and storing the change events. It concludes by offering a demo of CDF in Jupyter notebooks.
Kappa vs Lambda Architectures and Technology ComparisonKai Wähner
Real-time data beats slow data. That’s true for almost every use case. Nevertheless, enterprise architects build new infrastructures with the Lambda architecture that includes separate batch and real-time layers.
This video explores why a single real-time pipeline, called Kappa architecture, is the better fit for many enterprise architectures. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively without the need for a Lambda architecture.
The main focus of the discussion is on Apache Kafka (and its ecosystem) as the de facto standard for event streaming to process data in motion (the key concept of Kappa), but the video also compares various technologies and vendors such as Confluent, Cloudera, IBM Red Hat, Apache Flink, Apache Pulsar, AWS Kinesis, Amazon MSK, Azure Event Hubs, Google Pub Sub, and more.
Video recording of this presentation:
https://youtu.be/j7D29eyysDw
Further reading:
https://www.kai-waehner.de/blog/2021/09/23/real-time-kappa-architecture-mainstream-replacing-batch-lambda/
https://www.kai-waehner.de/blog/2021/04/20/comparison-open-source-apache-kafka-vs-confluent-cloudera-red-hat-amazon-msk-cloud/
https://www.kai-waehner.de/blog/2021/05/09/kafka-api-de-facto-standard-event-streaming-like-amazon-s3-object-storage/
This document discusses NoSQL and the CAP theorem. It begins with an introduction of the presenter and an overview of topics to be covered: What is NoSQL and the CAP theorem. It then defines NoSQL, provides examples of major NoSQL categories (document, graph, key-value, and wide-column stores), and explains why NoSQL is used, including to handle large, dynamic, and distributed data. The document also explains the CAP theorem, which states that a distributed data store can only satisfy two of three properties: consistency, availability, and partition tolerance. It provides examples of how to choose availability over consistency or vice versa. Finally, it concludes that both SQL and NoSQL have valid use cases and a combination
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
This document provides an overview and introduction to NoSQL databases. It begins with an agenda that explores key-value, document, column family, and graph databases. For each type, 1-2 specific databases are discussed in more detail, including their origins, features, and use cases. Key databases mentioned include Voldemort, CouchDB, MongoDB, HBase, Cassandra, and Neo4j. The document concludes with references for further reading on NoSQL databases and related topics.
Change Data Streaming Patterns for Microservices With Debezium confluent
(Gunnar Morling, RedHat) Kafka Summit SF 2018
Debezium (noun | de·be·zi·um | /dɪ:ˈbɪ:ziːəm/): secret sauce for change data capture (CDC) streaming changes from your datastore that enables you to solve multiple challenges: synchronizing data between microservices, gradually extracting microservices from existing monoliths, maintaining different read models in CQRS-style architectures, updating caches and full-text indexes and feeding operational data to your analytics tools
Join this session to learn what CDC is about, how it can be implemented using Debezium, an open source CDC solution based on Apache Kafka and how it can be utilized for your microservices. Find out how Debezium captures all the changes from datastores such as MySQL, PostgreSQL and MongoDB, how to react to the change events in near real time and how Debezium is designed to not compromise on data correctness and completeness also if things go wrong. In a live demo we’ll show how to set up a change data stream out of your application’s database without any code changes needed. You’ll see how to sink the change events into other databases and how to push data changes to your clients using WebSockets.
This document discusses different types of distributed databases. It covers data models like relational, aggregate-oriented, key-value, and document models. It also discusses different distribution models like sharding and replication. Consistency models for distributed databases are explained including eventual consistency and the CAP theorem. Key-value stores are described in more detail as a simple but widely used data model with features like consistency, scaling, and suitable use cases. Specific key-value databases like Redis, Riak, and DynamoDB are mentioned.
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
This document provides an overview of NoSQL databases and compares them to relational databases. It discusses the different types of NoSQL databases including key-value stores, document databases, wide column stores, and graph databases. It also covers some common concepts like eventual consistency, CAP theorem, and MapReduce. While NoSQL databases provide better scalability for massive datasets, relational databases offer more mature tools and strong consistency models.
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
Justin and I gave this talk in QCon SF 2014 about the Mantis, a stream processing system that features a reactive programming API, auto scaling, and stream locality
Monitoring at scale - Intuitive dashboard designLorenzo Alberton
At a certain scale, millions of events happen every second, and all of them are important to evaluate the health of the system. If not handled correctly, such a volume of information can overwhelm both the infrastructure that needs to support them, and people who have to make a sense out of thousands of signals and make decisions upon them, fast. By understanding how our rational mind works, how people process information, we can present data so it's more evident and intuitive. This talk will explain how to collect useful metrics, and to create the perfect monitoring dashboard to organise and display them, letting our intuition operate automatically and quickly, and saving attention and mental effort to activities that demand it.
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesLorenzo Alberton
The first part of a series of talks about modern algorithms and data structures, used by nosql databases like HBase and Cassandra. An explanation of Bloom Filters and several derivates, and Merkle Trees.
Scalable Architectures - Taming the Twitter FirehoseLorenzo Alberton
The document discusses scalable architectures for real-time platforms like Twitter. It covers using service-oriented architectures (SOAs) to scale individual components independently. Specific techniques discussed include using message queues like Redis and Kafka to decouple components and smooth load, caching with Varnish, load balancing with HAProxy, and distributing processing loads across worker processes using message patterns like ZeroMQ's PUSH-PULL. APIs and service discovery with Zookeeper are also covered. The overall goal is to scale all aspects of the platform.
The ability to grow (and shrink) according to the needs and the available resources is an essential part of designing applications. In this talk we'll cover the fundamental elements of scalability, including aspects involving people, processes and technology. With sound and proven principles and some advice on how to shape your organisation, set the right processes and design your application, this session is a must-see for developers and technical leads alike.
The document summarizes a meetup about NoSQL databases hosted by AWS in Sydney in 2012. It includes an agenda with presentations on Introduction to NoSQL and using EMR and DynamoDB. NoSQL is introduced as a class of databases that don't use SQL as the primary query language and are focused on scalability, availability and handling large volumes of data in real-time. Common NoSQL databases mentioned include DynamoDB, BigTable and document databases.
The document introduces MongoDB as an open source, high performance database that is a popular NoSQL option. It discusses how MongoDB stores data as JSON-like documents, supports dynamic schemas, and scales horizontally across commodity servers. MongoDB is seen as a good alternative to SQL databases for applications dealing with large volumes of diverse data that need to scale.
This document provides an overview of different database types including relational, NoSQL, document, key-value, graph, and column family databases. It discusses the history and drivers behind the development of NoSQL databases, as well as concepts like horizontal scaling, the CAP theorem, and eventual consistency. Specific databases are also summarized, including MongoDB, Redis, Neo4j, and HBase.
Graphs in the Database: Rdbms In The Social Networks AgeLorenzo Alberton
Despite the NoSQL movement trying to flag traditional databases as a dying breed, the RDBMS keeps evolving and adding new powerful weapons to its arsenal. In this talk we'll explore Common Table Expressions (SQL-99) and how SQL handles recursion, breaking the bi-dimensional barriers and paving the way to more complex data structures like trees and graphs, and how we can replicate features from social networks and recommendation systems. We'll also have a look at window functions (SQL:2003) and the advanced reporting features they make finally possible.
NoSQL databases are currently used in several applications scenarios in contrast to Relations Databases. Several type of Databases there exist. In this presentation we compare Key Value, Column Oriented, Document Oriented and Graph Databases. Using a simple case study there are evaluated pros and cons of the NoSQL databases taken into account.
This document provides an overview and introduction to NoSQL databases. It discusses key-value stores like Dynamo and BigTable, which are distributed, scalable databases that sacrifice complex queries for availability and performance. It also explains column-oriented databases like Cassandra that scale to massive workloads. The document compares the CAP theorem and consistency models of these databases and provides examples of their architectures, data models, and operations.
Trees In The Database - Advanced data structuresLorenzo Alberton
Storing tree structures in a bi-dimensional table has always been problematic. The simplest tree models are usually quite inefficient, while more complex ones aren't necessarily better. In this talk I briefly go through the most used models (adjacency list, materialized path, nested sets) and introduce some more advanced ones belonging to the nested intervals family (Farey algorithm, Continued Fractions, and other encodings). I describe the advantages and pitfalls of each model, some proprietary solutions (e.g. Oracle's CONNECT BY) and one of the SQL Standard's upcoming features, Common Table Expressions.
This document provides an overview of NoSQL databases. It begins by defining NoSQL as non-relational databases that are distributed, open source, and horizontally scalable. It then discusses some of the limitations of relational databases that led to the rise of NoSQL, such as issues with scalability and the need for flexible schemas. The document also summarizes some key NoSQL concepts, including the CAP theorem, ACID versus BASE, and eventual consistency. It provides examples of use cases for NoSQL databases and discusses some common NoSQL database types and how they address scalability.
Oracle Database regularly outperforms IBM DB2 on industry benchmarks due to technical differences in concurrency control, locking, indexing and partitioning capabilities. Oracle provides multi-version read consistency and non-escalating row-level locking, avoiding the performance penalties of DB2's read locks and lock escalation. Oracle also supports more indexing options like bitmap indexes that improve data warehousing performance.
The document describes an OpenFlow controller called Floodlight that is open source and written in Java, discusses how it works and some of its main components, and provides an overview of using OpenFlow and the Floodlight controller to build software-defined networks through examples of real world use cases.
This document provides an overview of NoSQL databases. It discusses that NoSQL databases are non-relational and were created to overcome limitations of scaling relational databases. The document categorizes NoSQL databases into key-value stores, document databases, graph databases, XML databases, and distributed peer stores. It provides examples like MongoDB, Redis, CouchDB, and Cassandra. The document also explains concepts like CAP theorem, ACID properties, and reasons for using NoSQL databases like horizontal scaling, schema flexibility, and handling large amounts of data.
This document discusses SQL Server 2012 AlwaysOn, a high availability and disaster recovery solution. It provides an overview of AlwaysOn availability groups, which allow for multiple synchronous or asynchronous copies of databases across instances. Key features include readable secondary replicas, automatic instance and database failover, and the ability to perform backups on secondary replicas. The document also demonstrates AlwaysOn configuration and functionality through a virtual machine-based lab environment.
A look at the changing development landscape and how we may have to rearchitect our Grails applications.
Also looks at existing, new, or potential Grails features that can help navigate this new world order.
The document discusses Samba Management Console (SMC), a project to create a graphical user interface for managing Samba servers. SMC aims to simplify Samba administration, provide a global view of multiple servers, and enable open integration with other systems through a REST API. The architecture uses Python, ExtJS, and a model-view-controller pattern. A demo interface was presented along with plans to improve integration, performance for large environments, and upgrade capabilities.
MongoDB should not be used if the data is highly relational and complex queries involving relations are needed, if multi-document transactions are required to be atomic, or if data consistency and durability are critical requirements. MongoDB sacrifices consistency and transactions for horizontal scalability and flexibility with unstructured data. It may be suitable for less critical data while a SQL database can be used for sensitive relational and transactional data.
Architecting for failure - Why are distributed systems hard?Markus Eisele
Devnexus 2017
As we architect our systems for greater demands, scale, uptime, and performance, the hardest thing to control becomes the environment in which we deploy and the subtle but crucial interactions between complicated systems. And microservices obviously are the way to go forward with those complicated systems. But what makes it so hard to build them? And why should you embrace failure instead of doing what we can do best: Preventing failure. This talk introduces you to the problem domain of a distributed system which consists of a couple of microservices. It shows how to build, deploy and orchestrate the chaos and introduces you to a couple of patterns to prevent and compensate failure.
MongoDB Ops Manager is an enterprise-grade end-to-end database management, monitoring, and backup solution. Kubernetes has clearly won the orchestration-platform "wars". In this session we'll take a deep dive on how you can leverage both these technologies to host your MongoDB deployments within your Kubernetes infrastructure whether that's OpenShift, PKS, Azure AKS, or just upstream. This talk will review the core technologies, such as containers, Kubernetes, and MongoDB Ops Manager. You'll also have a chance to see real-live demos of MongoDB running on Kubernetes and managed with MongoDB Ops Manager with the MongoDB Enterprise Kubernetes Operator.
Fundamentals Of Transaction Systems - Part 2: Certainty suppresses Uncertaint...Valverde Computing
The document discusses transaction systems and consistency models. It summarizes that:
- Brewer's CAP theorem states that distributed systems can only achieve two of consistency, availability, and partition tolerance.
- Many financial systems achieve all three by using private networks and 3-phase commit, challenging assumptions of the CAP theorem.
- Workflow systems can help achieve consistency across inconsistent distributed systems by driving them into acceptable states.
Clustered Architecture Patterns Delivering Scalability And AvailabilityConSanFrancisco123
The document discusses different architecture patterns for delivering scalability and availability in clustered systems. It covers load-balanced and partitioned scale-out patterns, and how to balance simplicity, scalability, and availability. JVM-level clustering is presented as an approach that can address these patterns by sharing memory across JVMs in a transparent way.
Enterprise Java in 2012 and Beyond, by Juergen Hoeller Codemotion
The Java space is facing several disruptive middleware trends. Key factors are the recent Java EE 6 and Java SE 7 platform releases, but also modern web clients, non-relational datastores and in particular cloud computing, all of which have a strong influence on the next generation of Java application frameworks. This session presents selected trends and explores their relevance for enterprise application development, taking the most recent Java SE and Java EE developments into account as well.
This document provides an introduction and overview of Oracle Database including:
1. Key features of Oracle Database such as supporting E.F. Codd's rules for relational databases and the Oracle Internet Platform.
2. The physical structures that make up an Oracle Database including data files, redo log files, and tablespaces.
3. The memory structures of an Oracle Instance including the system global area and process/background memory.
4. An overview of basic SQL statements for data retrieval, manipulation, and schema object management.
This configuration example shows an S-Series switch configured for access with a port channel connecting to an upstream aggregation switch. Key aspects include assigning interfaces to VLANs 1, 5, and 7; creating port channel 1 with interfaces 0/46-47 in active LACP mode; and disabling spanning tree on the port channel interfaces for the connection to the MLAG upstream.
Apache Jackrabbit Oak is a new JCR implementation with a completely new architecture. Based on concepts like eventual consistency and multi-version concurrency control, and borrowing ideas from distributed version control systems and cloud-scale databases, the Oak architecture is a major leap ahead for Jackrabbit. This presentation describes the Oak architecture and shows what it means for the scalability and performance of modern content applications. Changes to existing Jackrabbit functionality are described and the migration process is explained.
The View - Leveraging Lotuscript for Database ConnectivityBill Buchan
The document discusses using LotusScript (LS) to connect Lotus Notes databases to external relational databases. It covers LSX, LS:DO, and DCR as methods for database connectivity from LotusScript server-side agents. The presentation provides code examples and discusses object-oriented design patterns for database connectivity classes. It demonstrates connecting to an Access database using ODBC and connecting to an Oracle database from LotusScript. The document emphasizes best practices like error handling, logging, and separating database-specific code.
Most developers get started building applications without giving a lot of thought to their database. Either they get told that the company does everything with Database X, or they Google around and end up using MySQL or MS Access. Neither of these is the wrong choice, but it's often not the best choice. I was one of those guys, first with Access and then MySQL. As I've moved through my career, I've used a lot of database systems, both relational and non-relational ("NoSQL"), and my go-to choice has become PostgreSQL.
I'm not going to spend much time on the "SQL vs NoSQL" debate. It's sort of a straw man argument, because they're solving a variety of fundamentally different problems. It's also much better in discussion format, for that same reason: so much variety.
Like everything else in technology, there is no one-size-fits-all solution, but I want to show why I think it's a great first choice for most things, and the reasons why a lot of other options fall short. Your choice of database can shape a lot of how you build your application, how well it performs, and how it can grow over time. It can be a productivity boon, or force you to constantly working around it's limitations. As application developers, we should be spending our time solving business problems, not on low-level technology plumbing.
This session is intended for people who have built a few database-backed applications, and are curious about what options are out there and how to go about choosing between them.
Similar to NoSQL Databases: Why, what and when (20)
Explore the essential graphic design tools and software that can elevate your creative projects. Discover industry favorites and innovative solutions for stunning design results.
Best Digital Marketing Strategy Build Your Online Presence 2024.pptxpavankumarpayexelsol
This presentation provides a comprehensive guide to the best digital marketing strategies for 2024, focusing on enhancing your online presence. Key topics include understanding and targeting your audience, building a user-friendly and mobile-responsive website, leveraging the power of social media platforms, optimizing content for search engines, and using email marketing to foster direct engagement. By adopting these strategies, you can increase brand visibility, drive traffic, generate leads, and ultimately boost sales, ensuring your business thrives in the competitive digital landscape.
19. A little theory
Fundamental Principles
of (Distributed) Databases
http://www.timbarcz.com/blog/PassionInProgrammers.aspx
11
20. ACID
ATOMICITY: All or nothing
CONSISTENCY: Any transaction will take the db from one
consistent state to another, with no broken constraints
(referential integrity)
ISOLATION: Other operations cannot access data that has
been modified during a transaction that has not yet completed
DURABILITY: Ability to recover the committed transaction
updates against any kind of system failure (transaction log)
12
21. Isolation Levels, Locking & MVCC
Isolation noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations
13
22. Isolation Levels, Locking & MVCC
Isolation noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations
SERIALIZABLE
All transactions occur in a
completely isolated fashion, as
if they were executed serially
13
23. Isolation Levels, Locking & MVCC
Isolation noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations
SERIALIZABLE REPEATABLE READ
All transactions occur in a Multiple SELECT statements
completely isolated fashion, as issued in the same transaction
if they were executed serially will always yield the same
result
13
24. Isolation Levels, Locking & MVCC
Isolation noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations
SERIALIZABLE REPEATABLE READ
All transactions occur in a Multiple SELECT statements
completely isolated fashion, as issued in the same transaction
if they were executed serially will always yield the same
result
READ COMMITTED
A lock is acquired only on the
rows currently read/updated
13
25. Isolation Levels, Locking & MVCC
Isolation noun
Property that defines how/when the changes made by one operation
become visible to other concurrent operations
SERIALIZABLE REPEATABLE READ
All transactions occur in a Multiple SELECT statements
completely isolated fashion, as issued in the same transaction
if they were executed serially will always yield the same
result
READ COMMITTED READ UNCOMMITTED
A lock is acquired only on the A transaction can access
rows currently read/updated uncommitted changes made
by other transactions
13
28. Multi-Version Concurrency Control
Root
Index
Index Index Index
Index Index Index Index Index Index Index
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
16
29. Multi-Version Concurrency Control
obsolete Root
new version
Index Index
Index Index Index Index
Index Index Index Index Index Index Index Index
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
16
30. Multi-Version Concurrency Control
obsolete Root atomic pointer update
new version
Index Index
Index Index Index Index
Index Index Index Index Index Index Index Index
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
16
31. Multi-Version Concurrency Control
obsolete Root atomic pointer update
new version marked for
compaction
Index Index
Index Index Index Index
Index Index Index Index Index Index Index Index
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
16
32. Multi-Version Concurrency Control
obsolete Root atomic pointer update
new version marked for
compaction
Index Index
Reads:
never
blocked
Index Index Index Index
Index Index Index Index Index Index Index Index
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
16
45. Distributed Transactions - 2PC
Coordinator
2) COMMIT
PHASE
(completion
phase)
Acknowledge
b) FAILURE
(abort from
any)
Participants
19
46. Distributed Transactions - 2PC
Coordinator
2) COMMIT
PHASE
Undo transaction (completion
phase)
b) FAILURE
(abort from
any)
Participants
19
47. Problems with 2PC
Blocking Protocol
Risk of indefinite cohort Conservative behaviour
blocks if coordinator fails biased to the abort case
20
48. Paxos Algorithm (Consensus)
Family of Fault-tolerant, distributed implementations
Spectrum of trade-offs:
Number of processors
Number of message delays
Activity level of participants
Number of messages sent
Types of failures
http://www.usenix.org/event/nsdi09/tech/full_papers/yabandeh/yabandeh_html/
http://en.wikipedia.org/wiki/Paxos_algorithm 21
50. ACID & Distributed Systems
http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
51. ACID & Distributed Systems
ACID properties are always desirable
But what about:
Latency
Partition Tolerance
High Availability
?
http://images.tribe.net/tribe/upload/photo/deb/074/deb074db-81fc-4b8a-bfbd-b18b922885cb 23
52. CAP Theorem (Brewer’s conjecture)
2000 Prof. Eric Brewer, PoDC Conference Keynote
2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)
Of three properties of shared-data systems -
data Consistency, system Availability and
tolerance to network Partitions - only two can
be achieved at any given moment in time.
http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
24
53. CAP Theorem (Brewer’s conjecture)
2000 Prof. Eric Brewer, PoDC Conference Keynote
2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)
Of three properties of shared-data systems -
data Consistency, system Availability and
tolerance to network Partitions - only two can
be achieved at any given moment in time.
http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
24
54. Partition Tolerance - Availability
“The network will be allowed to lose arbitrarily many messages
sent from one node to another” [...]
“For a distributed system to be continuously available, every
request received by a non-failing node in the system must result
in a response” - Gilbert and Lynch, SIGACT 2002
http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
55. Partition Tolerance - Availability
“The network will be allowed to lose arbitrarily many messages
sent from one node to another” [...]
“For a distributed system to be continuously available, every
request received by a non-failing node in the system must result
in a response” - Gilbert and Lynch, SIGACT 2002
http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
56. Partition Tolerance - Availability
“The network will be allowed to lose arbitrarily many messages
sent from one node to another” [...]
“For a distributed system to be continuously available, every
request received by a non-failing node in the system must result
in a response” - Gilbert and Lynch, SIGACT 2002
CP: requests can complete at nodes
that have quorum
AP: requests can complete at any
live node, possibly violating strong
consistency
http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
57. Partition Tolerance - Availability
“The network will be allowed to lose arbitrarily many messages
sent from one node to another” [...]
“For a distributed system to be continuously available, every
request received by a non-failing node in the system must result
in a response” - Gilbert and Lynch, SIGACT 2002
CP: requests can complete at nodes
that have quorum
HIGH LATENCY
AP: requests can complete at any ≈
live node, possiblyPARTITION
NETWORK violating strong
consistency
http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html
http://codahale.com/you-cant-sacrifice-partition-tolerance http://pl.atyp.us/wordpress/?p=2521 25
58. Consistency: Client-side view
A service that is consistent operates fully or not at all.
Strong consistency (as in ACID)
Weak consistency (no guarantee) - Inconsistency window
(*) Temporary inconsistencies
(e.g. in data constraints or
replica versions) are
accepted, but they’re resolved
at the earliest opportunity
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
59. Consistency: Client-side view
A service that is consistent operates fully or not at all.
Strong consistency (as in ACID)
Weak consistency (no guarantee) - Inconsistency window
Eventual* consistency (e.g. DNS)
Causal consistency
Read-your-writes consistency
(the least surprise)
Session consistency (*) Temporary inconsistencies
(e.g. in data constraints or
Monotonic read consistency replica versions) are
accepted, but they’re resolved
Monotonic write consistency at the earliest opportunity
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html 26
60. Consistency: Server-side (Quorum)
N = number of nodes with a replica of the data
(*)
W = number of replicas that must acknowledge the update
R = minimum number of replicas that must participate in a
successful read operation
(*) but the data will be written to N nodes no matter what
W+R>N Strong consistency (usually N=3, W=R=2)
W = N, R =1 Optimised for reads
W = 1, R = N Optimised for writes
(durability not guaranteed in presence of failures)
W + R <= N Weak consistency
27
61. Amazon Dynamo Paper
Consistent Hashing
Vector Clocks
Gossip Protocol
Hinted Handoffs
Read Repair
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf 28
66. Modulo-based Hashing
N1 N2 N3 N4
partition = key % n_servers - 1)
(n_servers
Recalculate the hashes for all the entries if n_servers changes
(i.e. full data redistribution when adding/removing a node)
29
67. Consistent Hashing
2160 0
A
F B
Ring Same hash function
E (key space) for data and nodes
idx = hash(key)
D
Coordinator: next
C
available clockwise
node
http://en.wikipedia.org/wiki/Consistent_hashing 30
68. Consistent Hashing
2160 0
A
F B
Ring Same hash function
E (key space) for data and nodes
idx = hash(key)
D
Coordinator: next
C
available clockwise
node
http://en.wikipedia.org/wiki/Consistent_hashing 30
69. Consistent Hashing
2160 0
A
canonical home
(coordinator node)
for key range A-B
F B
Ring Same hash function
E (key space) for data and nodes
idx = hash(key)
D
Coordinator: next
C
available clockwise
node
http://en.wikipedia.org/wiki/Consistent_hashing 30
70. Consistent Hashing
2160 0
A
F B
Ring Same hash function
E (key space) for data and nodes
idx = hash(key)
D
Coordinator: next
C
available clockwise
node
http://en.wikipedia.org/wiki/Consistent_hashing 30
71. Consistent Hashing
2160 0
A
F B
Ring Same hash function
E (key space) for data and nodes
idx = hash(key)
D
Coordinator: next
C canonical home
for key range A-C available clockwise
node
http://en.wikipedia.org/wiki/Consistent_hashing 30
72. Consistent Hashing
2160 0
only the keys in this
range change location
A
F B
Ring Same hash function
E (key space) for data and nodes
idx = hash(key)
D
Coordinator: next
C canonical home
for key range A-C available clockwise
node
http://en.wikipedia.org/wiki/Consistent_hashing 30
73. Consistent Hashing - Replication
A
F B
Ring
E (key space)
D
C
http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
74. Consistent Hashing - Replication
Key hosted
AB
A in B, C, D
F B
Data replicated in
Ring the N-1 clockwise
E (key space) successor nodes
D
C Node hosting
Key , Key , Key
FA AB BC
http://horicky.blogspot.com/2009/11/nosql-patterns.html 31
76. Consistent Hashing - Node Changes
Key membership
A and replicas are
updated when a
F B
node joins or leaves
Copy Key the network.
Range AB The number of
E
Copy Key replicas for all data
Range FA is kept consistent.
D
C Copy Key
Range EF
32
77. Consistent Hashing - Load Distribution
2160 0
Different Strategies
A
I
Virtual Nodes
H B
Random tokens per each
Ring physical node, partition by
C
G (key space) token value
D
Node 1: tokens A, E, G
F Node 2: tokens C, F, H
E Node 3: tokens B, D, I
33
78. Consistent Hashing - Load Distribution
2160 0
Different Strategies
Virtual Nodes
Q equal-sized partitions,
Ring S nodes, Q/S tokens per
(key space) node (with Q >> S)
Node 1
Node 2
Node 3
Node 4
...
34
79. Vector Clocks & Conflict Detection
A B C write handled by A
Causality-based partial
order over events that
D1 ([A, 1]) happen in the system.
Document version
history: a counter for
each node that updated
the document.
If all update counters in
V1 are smaller or equal
to all update counters in
V2, then V1 precedes V2.
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
80. Vector Clocks & Conflict Detection
A B C write handled by A
Causality-based partial
order over events that
D1 ([A, 1]) happen in the system.
write handled by A
Document version
D2 ([A, 2]) history: a counter for
each node that updated
the document.
If all update counters in
V1 are smaller or equal
to all update counters in
V2, then V1 precedes V2.
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
81. Vector Clocks & Conflict Detection
A B C write handled by A
Causality-based partial
order over events that
D1 ([A, 1]) happen in the system.
write handled by A
Document version
D2 ([A, 2]) history: a counter for
each node that updated
write handled by B write handled by C
the document.
D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in
V1 are smaller or equal
to all update counters in
V2, then V1 precedes V2.
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
82. Vector Clocks & Conflict Detection
A B C write handled by A
Causality-based partial
order over events that
D1 ([A, 1]) happen in the system.
write handled by A
Document version
D2 ([A, 2]) history: a counter for
each node that updated
write handled by B write handled by C
the document.
D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in
V1 are smaller or equal
conflict detected reconciliation handled by A
to all update counters in
? V2, then V1 precedes V2.
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
83. Vector Clocks & Conflict Detection
A B C write handled by A
Causality-based partial
order over events that
D1 ([A, 1]) happen in the system.
write handled by A
Document version
D2 ([A, 2]) history: a counter for
each node that updated
write handled by B write handled by C
the document.
D3 ([A, 2], [B, 1]) D4 ([A, 2], [C,1]) If all update counters in
V1 are smaller or equal
conflict detected reconciliation handled by A
to all update counters in
D5 ([A, 3], [B, 1], [C,1])
? V2, then V1 precedes V2.
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 35
84. Vector Clocks & Conflict Detection
A B C write handled by A
Vector Clocks can detect
a conflict. The conflict
D1 ([A, 1]) resolution is left to the
application or the user.
The application might
resolve conflicts by
checking relative
timestamps, or with
other strategies (like
merging the changes).
Vector clocks can grow
quite large (!)
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
85. Vector Clocks & Conflict Detection
A B C write handled by A
Vector Clocks can detect
a conflict. The conflict
D1 ([A, 1]) resolution is left to the
write handled by A application or the user.
The application might
D2 ([A, 2])
resolve conflicts by
checking relative
timestamps, or with
other strategies (like
merging the changes).
Vector clocks can grow
quite large (!)
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
86. Vector Clocks & Conflict Detection
A B C write handled by A
Vector Clocks can detect
a conflict. The conflict
D1 ([A, 1]) resolution is left to the
write handled by A application or the user.
The application might
D2 ([A, 2])
resolve conflicts by
write handled by B un-modified replica checking relative
timestamps, or with
D3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like
merging the changes).
Vector clocks can grow
quite large (!)
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
87. Vector Clocks & Conflict Detection
A B C write handled by A
Vector Clocks can detect
a conflict. The conflict
D1 ([A, 1]) resolution is left to the
write handled by A application or the user.
The application might
D2 ([A, 2])
resolve conflicts by
write handled by B un-modified replica checking relative
timestamps, or with
D3 ([A, 2], [B, 1]) D4 ([A, 2]) other strategies (like
merging the changes).
version mismatch D3 ⊇ D4, conflict
detected resolved automatically
Vector clocks can grow
D5 ([A, 3], [B, 1]) quite large (!)
http://en.wikipedia.org/wiki/Vector_clock http://pl.atyp.us/wordpress/?p=2601 36
88. Gossip Protocol + Hinted Handoff
A
periodic, pairwise,
F B inter-process
interactions of
bounded size
E among randomly-
chosen peers
D
C
37
89. Gossip Protocol + Hinted Handoff
A
I can’t see B, it might be periodic, pairwise,
F down but I need some B inter-process
ACK. My Merkle Tree
root for range XY is interactions of
“ab031dab4a385afda” bounded size
E among randomly-
I can’t see B either.
My Merkle Tree root for chosen peers
range XY is different!
D
C
B must be down
then. Let’s disable it.
37
90. Gossip Protocol + Hinted Handoff
My canonical node is
supposed to be B.
A
periodic, pairwise,
F B inter-process
interactions of
bounded size
E among randomly-
chosen peers
D I see. Well, I’ll take care of it
for now, and let B know
C
when B is available again
37
91. Merkle Trees (Hash Trees)
Leaves: hashes of
ROOT
hash(A, B) data blocks.
Nodes: hashes of
their children.
A B
hash(C, D) hash(E, F)
Used to detect
inconsistencies
C D E F between replicas
hash(001) hash(002) hash(003) hash(004)
(anti-entropy) and
to minimise the
Data Data Data Data
Block Block Block Block amount of
001 002 003 004 transferred data
http://en.wikipedia.org/wiki/Hash_tree 38
103. Voldemort AP
LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks Apache 2
LANGUAGE
Java
API/PROTOCOL
HTTP Java
Thrift
Avro
Protobuf
PERSISTENCE
Pluggable
BDB/MySQL
CONCURRENCY
MVCC
Simple optimistic locking
for multi-row updates,
pluggable storage engine
43
104. Voldemort AP
LICENSE
Dynamo DHT implementation
Consistent hashing,Vector clocks Apache 2
LANGUAGE
Java
API/PROTOCOL
HTTP Java
Thrift
Avro
Protobuf
PERSISTENCE
Pluggable
BDB/MySQL
CONCURRENCY
MVCC
43
105. Membase CP
LICENSE
DHT (K-V), no SPoF
Apache 2
“VBuckets” LANGUAGE
C/C++
membase memcached Erlang
API/PROTOCOL
persistence distributed
replication in-memory REST/JSON
(fail-over HA) memcached
rebalancing
Unit of consistency and replication
Owner of a subset of the cluster key space
http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
106. Membase CP
LICENSE
DHT (K-V), no SPoF
Apache 2
“VBuckets” LANGUAGE
C/C++
membase memcached Erlang
API/PROTOCOL
persistence distributed
replication in-memory REST/JSON
(fail-over HA) memcached
rebalancing
Unit of consistency and replication
Owner of a subset of the cluster key space
hash function + table lookup
http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
107. Membase CP
LICENSE
DHT (K-V), no SPoF
Apache 2
“VBuckets” LANGUAGE
C/C++
membase memcached Erlang
API/PROTOCOL
persistence distributed
replication in-memory REST/JSON
(fail-over HA) memcached
rebalancing
Unit of consistency and replication
Owner of a subset of the cluster key space
hash function + table lookup
All metadata kept in memory (high throughput / low latency).
Manual/Programmatic failover via the Management REST API.
http://dustin.github.com/2010/06/29/memcached-vbuckets.html 44
109. Redis CP
LICENSE
K-V store “Data Structures Server”
BSD
Map, Set, Sorted Set, Linked List LANGUAGE
Set/Queue operations, Counters, Pub-Sub, Volatile keys ANSI C
API
*
+ PROTOCOL
Telnet-
like
PERSISTENCE
10-100K op/s (whole dataset in RAM + VM)
in memory
bg snapshots
Persistence via snapshotting (tunable fsync freq.) REPLICATION
master-slave
Distributed if client supports consistent hashing
http://redis.io/presentation/Redis_Cluster.pdf 46
110. 2) Column Families
Google BigTable paper
Data model: big table, column families
47
111. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
“com.cnn.www” <html>... “CNN” “CNN.com”
column column column
row_key row
http://labs.google.com/papers/bigtable-osdi06.pdf 48
112. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
“com.cnn.www” <html>... “CNN” “CNN.com”
column column column
row_key row
http://labs.google.com/papers/bigtable-osdi06.pdf 48
113. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
“com.cnn.www” <html>... “CNN” “CNN.com”
column column column
row_key row
http://labs.google.com/papers/bigtable-osdi06.pdf 48
114. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
<html>... t3
<html>... t5
“com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8
column column column
row_key row
http://labs.google.com/papers/bigtable-osdi06.pdf 48
115. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
<html>... t3
<html>... t5
“com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8
column column column
row_key row column family
http://labs.google.com/papers/bigtable-osdi06.pdf 48
116. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
<html>... t3
<html>... t5
“com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8
column column column
row_key row column family
ACL
http://labs.google.com/papers/bigtable-osdi06.pdf 48
117. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
<html>... t3
<html>... t5
“com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8
column column column
row_key row column family
Atomic updates ACL
http://labs.google.com/papers/bigtable-osdi06.pdf 48
118. Google BigTable Paper
Sparse, distributed, persistent multi-dimensional sorted map
indexed by (row_key, column_key, timestamp)
CF:col_name “contents:html” “anchor:cnnsi.com” “anchor:my.look.ca”
<html>... t3
<html>... t5
“com.cnn.www” <html>... t6 “CNN” t9 “CNN.com” t8
column column column
row_key row column family
Atomic updates Automatic GC ACL
http://labs.google.com/papers/bigtable-osdi06.pdf 48
119. Google BigTable: Data Structure
SSTable
Smallest building block
Persistent immutable Map[k,v]
Operations: lookup by key / key range scan
SSTable
64KB 64KB 64KB
block block block lookup
index
49
120. Google BigTable: Data Structure
SSTable
Tablet
Smallest building block range of rows
Dynamically partitioned
Persistent immutable Map[k,v]
Built from multiple SSTables
Operations: lookup and loadkey range scan
Unit of distribution by key / balancing
Tablet (range Aaa → Bar)
SSTable SSTable
64KB 64KB 64KB 64KB 64KB 64KB
block block block lookup block block block lookup
index index
49
121. Google BigTable: Data Structure
SSTable
Table
Tablet
Smallest Tablets (table segments) make up a table
Multiple building block
Dynamically partitioned range of rows
Persistent immutable Map[k,v]
Built from multiple SSTables
Operations: lookup and loadkey range scan
Unit of distribution by key / balancing
Table
Tablet (range Aaa → Bar)
SSTable SSTable
64KB 64KB 64KB 64KB 64KB 64KB
block block block lookup block block block lookup
index index
49
123. Google BigTable: I/O
memtable read
minor
compaction
memory
GFS
tablet log
SSTable SSTable SSTable
write
50
124. Google BigTable: I/O
memtable read
minor
compaction
memory
GFS
tablet log
SSTable SSTable SSTable
write BMDiff Zippy
50
125. Google BigTable: I/O
memtable read
minor
compaction
memory
GFS
tablet log
SSTable SSTable SSTable
write BMDiff Zippy
merging / major compaction (GC)
50
126. Google BigTable: Location Dereferencing
Metadata Tablets User Tables
... ...
Root Tablet
Master File
...
Chubby ... ...
Replicated, persisted
Root of the
lock service; maintains
metadata tree
tablet server locations
5 replicas, one elected ...
master (via quorum)
Up to 3 levels ...
Paxos algorithm used in the metadata
to keep consistency hierarchy
51
127. Google BigTable: Architecture
fs metadata, ACL,
GC, load balancing
BigTable metadata operations BigTable
client master
data R/W heartbeat
operations messages, GC,
chunk migration
Tablet Tablet Tablet Chubby
Server Server Server track
master lock,
log of live servers
Tablet Tablet Tablet
52
128. HBase CP
LICENSE
OSS implementation of BigTable
Apache 2
LANGUAGE
Java
API/PROTOCOL
REST HTTP
Thrift
PERSISTENCE
memtable/
SSTable
53
129. HBase CP
LICENSE
OSS implementation of BigTable
Apache 2
LANGUAGE
Java
ZooKeeper as API/PROTOCOL
coordinator REST HTTP
Thrift
(instead of Chubby)
PERSISTENCE
memtable/
SSTable
53
130. HBase CP
LICENSE
OSS implementation of BigTable
Apache 2
LANGUAGE
Support for Java
multiple masters API/PROTOCOL
REST HTTP
Thrift
PERSISTENCE
memtable/
SSTable
53
132. HBase CP
LICENSE
OSS implementation of BigTable
Apache 2
LANGUAGE
Java
API/PROTOCOL
REST HTTP
Thrift
PERSISTENCE
memtable/
Data sorted by key SSTable
but evenly distributed
across the cluster
53
134. HBase CP
LICENSE
OSS implementation of BigTable
Apache 2
LANGUAGE
Java
API/PROTOCOL
REST HTTP
Thrift
PERSISTENCE
memtable/
SSTable
53
135. Hypertable CP
LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s) GPLv2
LANGUAGE
C++
API/PROTOCOL
C++
Thrift
PERSISTENCE
memtable/
SSTable
CONCURRENCY
MVCC
HQL (~SQL)
54
136. Hypertable CP
LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s) GPLv2
LANGUAGE
C++
API/PROTOCOL
C++
Hyperspace Thrift
(paxos) used PERSISTENCE
instead of memtable/
SSTable
ZooKeeper
CONCURRENCY
MVCC
HQL (~SQL)
54
137. Hypertable CP
LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s) GPLv2
LANGUAGE
C++
API/PROTOCOL
C++
Thrift
Dynamically PERSISTENCE
adapts to
memtable/
changes in SSTable
workload CONCURRENCY
MVCC
HQL (~SQL)
54
138. Hypertable CP
LICENSE
OSS BigTable implementation
Faster than HBase (10-30K/s) GPLv2
LANGUAGE
C++
API/PROTOCOL
C++
Thrift
PERSISTENCE
memtable/
SSTable
CONCURRENCY
MVCC
HQL (~SQL)
54
139. Cassandra AP
LICENSE
Data model of BigTable, infrastructure of Dynamo
Apache 2
LANGUAGE
Java
PROTOCOL
B
col_name Thrift
Avro
col_value PERSISTENCE
timestamp
memtable/
SSTable
Column
CONSISTENCY
Tunable
R/W/N
x
http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
140. Cassandra AP
LICENSE
Data model of BigTable, infrastructure of Dynamo
Apache 2
LANGUAGE
super_column_name Java
PROTOCOL
B
col_name col_name Thrift
... Avro
PERSISTENCE
col_value col_value
timestamp timestamp
memtable/
SSTable
CONSISTENCY
Tunable
R/W/N
x
http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
141. Cassandra AP
LICENSE
Data model of BigTable, infrastructure of Dynamo
Column Family Apache 2
LANGUAGE
Java
PROTOCOL
B
col_name col_name Thrift
row_key
... Avro
PERSISTENCE
col_value col_value
timestamp timestamp
memtable/
SSTable
CONSISTENCY
Tunable
R/W/N
x
http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55
142. Cassandra AP
LICENSE
Data model of BigTable, infrastructure of Dynamo
Super Column Family Apache 2
LANGUAGE
super_column_name super_column_name Java
PROTOCOL
B
col_name col_name col_name col_name Thrift
row_key
... ... ... Avro
col_value col_value col_value col_value PERSISTENCE
timestamp timestamp timestamp timestamp
memtable/
SSTable
CONSISTENCY
keyspace.get("column_family", key, ["super_column",] "column")
Tunable
R/W/N
x
http://www.javageneration.com/?p=70 @cassandralondon http://www.meetup.com/Cassandra-London/ 55