Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.
MyRocks is an open source LSM based MySQL database, created by Facebook. This slides introduce MyRocks overview and how we deployed at Facebook, as of 2017.
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...ScyllaDB
Doing performance tuning on a massively distributed database is never an easy task. This is especially true for TiDB, an open-source, cloud-native NewSQL database for elastic scale and real-time analytics, because it consists of multiple components and each component has plenty of metrics.
Like many distributed systems, TiDB uses Prometheus to store the monitoring and performance metrics and Grafana to visualize these metrics. Thanks to these two open source projects, it is easy for TiDB developers to add monitoring and performance metrics. However, as the metrics increase, the learning curve becomes steeper for TiDB users to gain performance insights. In this talk, we will share how we measure latency in a distributed system using a top-down (holistic) approach, and why we introduced "tuning by database time" and "tuning by color" into TiDB. The new methodologies and Grafana dashboard help reduce the time and the requirement of expertise in performance tuning by orders of magnitude.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
Facebook created a new storage engine called MyRocks to optimize space and write performance, and recently migrated both UDB (a database for social activities, and our biggest in production) and Facebook Messenger to MyRocks. In this session, Yoshinori Matsunobu of Facebook talks about the challenges, benefits and lessons learned by migrating these applications from InnoDB to MyRocks.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.
MyRocks is an open source LSM based MySQL database, created by Facebook. This slides introduce MyRocks overview and how we deployed at Facebook, as of 2017.
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...ScyllaDB
Doing performance tuning on a massively distributed database is never an easy task. This is especially true for TiDB, an open-source, cloud-native NewSQL database for elastic scale and real-time analytics, because it consists of multiple components and each component has plenty of metrics.
Like many distributed systems, TiDB uses Prometheus to store the monitoring and performance metrics and Grafana to visualize these metrics. Thanks to these two open source projects, it is easy for TiDB developers to add monitoring and performance metrics. However, as the metrics increase, the learning curve becomes steeper for TiDB users to gain performance insights. In this talk, we will share how we measure latency in a distributed system using a top-down (holistic) approach, and why we introduced "tuning by database time" and "tuning by color" into TiDB. The new methodologies and Grafana dashboard help reduce the time and the requirement of expertise in performance tuning by orders of magnitude.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
Facebook created a new storage engine called MyRocks to optimize space and write performance, and recently migrated both UDB (a database for social activities, and our biggest in production) and Facebook Messenger to MyRocks. In this session, Yoshinori Matsunobu of Facebook talks about the challenges, benefits and lessons learned by migrating these applications from InnoDB to MyRocks.
Slides for a talk.
Talk abstract:
In the dark of the night, if you listen carefully enough, you can hear databases cry. But why? As developers, we rarely consider what happens under the hood of widely used abstractions such as databases. As a consequence, we rarely think about the performance of databases. This is especially true to less widespread, but often very useful NoSQL databases.
In this talk we will take a close look at NoSQL database performance, peek under the hood of the most frequently used features to see how they affect performance and discuss performance issues and bottlenecks inherent to all databases.
Vote NO for MySQL - Election 2012: NoSQL. Researchers predict a dark future for MySQL. Significant market loss to come. Are things that bad, is MySQL falling behind? A look at NoSQL, an attempt to identify different kinds of NoSQL stores, their goals and how they compare to MySQL 5.6. Focus: Key Value Stores and Document Stores. MySQL versus NoSQL means looking behind the scenes, taking a step back and looking at the building blocks.
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
Alluxio Community Office Hour
Apr 7, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker: Bin Fan
Alluxio (alluxio.io) is an open-source data orchestration system that provides a single namespace federating multiple external distributed storage systems. It is critical for Alluxio to be able to store and serve the metadata of all files and directories from all mounted external storage both at scale and at speed.
This talk shares our design, implementation, and optimization of Alluxio metadata service (master node) to address the scalability challenges. Particularly, we will focus on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc. As a result of the combined above techniques, Alluxio 2.0 is able to store at least 1 billion files with a significantly reduced memory requirement, serving 3000 workers and 30000 clients concurrently.
In this Office Hour, we will go over how to:
- Metadata storage challenges
- How to combine different open source technologies as building blocks
- The design, implementation, and optimization of Alluxio metadata service
SQL Server Reporting Services Disaster Recovery webinarDenny Lee
This is the PASS DW|BI virtual chapter webinar on SQL Server Reporting Services Disaster Recovery with Ayad Shammout and myself - hosted by Julie Koesmarno (@mssqlgirl)
7 Reasons Not to Put an External Cache in Front of Your Database.pptxScyllaDB
Teams experiencing subpar latency commonly turn to an external cache to meet the required SLAs. Placing a cache in front of your database might seem like a fast and easy fix, but it often ends up introducing unanticipated complexity, costs, and risks. Caches can be one of the more problematic components of distributed application architecture.
Join this webinar for a technical discussion of the risks associated with using an external cache and a look at an alternative strategy that simplifies your architecture without compromising latency. We’ll cover:
- Different approaches to caching (pre-caching vs. caching, side cache vs. transparent cache)
- 7 specific reasons why external caching can be a bad choice
- Why Linux’s default caching doesn’t work well for databases
- The advantages & architecture of specialized row-based caches
- Real-world examples of why and how teams eliminated their external cache
Deep dive into Clustered Columnstore structures with information on compression algorithms, compression types, locking and dictionaries, as well as the Batch Processing Mode.
SQL Server Reporting Services Disaster Recovery WebinarDenny Lee
This is the PASS DW/BI Webinar for SQL Server Reporting Services (SSRS) Disaster Recovery webinar. You can find the video at: http://www.youtube.com/watch?v=gfT9ETyLRlA
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Strategies for Successful Data Migration Tools.pptxvarshanayak241
Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Hivelance Technology
Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders
Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
Tim Combridge from Sensible Giraffe and Salesforce Ben presents some important tips that all developers should know when dealing with Flows in Salesforce.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
2. 1. RocksDB Overview
2. Differences between LSM and B+Tree
3. Performance Practices
4. Operations and Reliability Practices
Agenda
3. ROCKSDB OVERVIEW
What is RocksDB
http://rocksdb.org/
Open Source Log-Structured Merge (LSM) database, forked from LevelDB
• Key-Value LSM persistent store
• Easier integration -- Embedded
• Native compression -- Optimized for fast storage
Used at many backend services at Meta, and many external large services and products
• Column Family, Transaction, Parallelism, etc
• Major use cases inside Meta:
﹘ MyRocks: MySQL on top of RocksDB (RocksDB Storage Engine)
﹘ ZippyDB: Distributed key value store on top of RocksDB
6. ROCKSDB OVERVIEW
Leveled Compaction
For each level, data is sorted by key
(In Level 0, data is sorted by key per file)
Compaction merges 1 Level n file + 10 Level n+1 files, then writing into Level n+1
Read Amplification: 1 ~ number of levels (depending on cache -- L0~L2 are usually cached)
Write Amplification: 1 + 1 + fanout * (number of levels – 2) / 2
Space Amplification: 1.11
• 11% is much smaller than B+Tree’s fragmentation
7. ROCKSDB OVERVIEW
RocksDB Features
Column Family
TransactionDB, BlobDB, TTLDB
Prefix Bloom Filter, Partitioned Filter
DeleteRange, SingleDelete
Merge Operator
Backup Engine
Most configuration parameters can be changed online
8. DIFFERENCES BETWEEN LSM AND B+TREE
LSM vs B+Tree
Smaller space usage
• Smaller fragmentation overhead
• Working well with compression (Saving better than InnoDB Compression)
Lower write amplification
Slower read performance. For memory bound workloads, it is relatively more visible.
Generally, faster write performance
• Maintaining secondary index is cheaper since LSM doesn’t need random reads
• Tables with only primary keys are slower to insert, due to higher unique key constraint check (Get) cost
Major difference vs InnoDB
• RocksDB TransactionDB does not support Gap Lock. Migrating from InnoDB Repeatable Read is tricky.
9. ROCKSDB PERFORMANCE
RocksDB Performance Practices
Use Jemalloc memory allocator
Understand RocksDB data formats, and keep important data sets in memory
Compression
Compaction
10. ROCKSDB PERFORMANCE
RocksDB file format – data, index and filter
<beginning_of_file>
[data block 1]
[data block 2]
...
[data block N]
[meta block 1: filter block]
[meta block 2: index block]
[meta block 3: compression dictionary block]
[meta block 4: range deletion block]
...
[meta block K: future extended block]
[metaindex block]
[Footer]
<end_of_file>
Data block -> Storing actual key/values
Filter block -> Storing bloom filter
Index block -> Offsets of each data block
Index block size depends on the number of data blocks
• 16KB -> 4KB data block will increase index block size by 4x
11. ROCKSDB PERFORMANCE
Index and Filter size reduction
Filter and Index block cache hit rate is important
Size info can be obtained from Table Property, and cache info is periodically logged in LOG
“optimize_filters_for_hits=true” avoids storing filter in Lmax (saving total filter size by 90%)
Ribbon Filter saves bloom filter size by ~30% with comparable CPU util
Parameters to save index block size
• format_version=4 or 5
• index_block_restart_interval=16
Watch rocksdb_block_cache_index_miss
enable_index_compression=false to save CPU time
MyRocks has information_schema to expose SST file metrics
mysql> select sum(data_block_size)/1024/1024/1024 as size_gb,
sum(index_block_size)/1024/1024/1024 as index_gb,
sum(filter_block_size)/1024/1024/1024 as filter_gb
from information_schema.rocksdb_sst_props;
+-------------------+----------------+----------------+
| size_gb | index_gb | filter_gb |
+-------------------+----------------+----------------+
| 1009.362400736660 | 2.661879514344 | 1.734282894991 |
+-------------------+----------------+----------------+
12. ROCKSDB PERFORMANCE
Direct I/O
RocksDB supports Direct I/O for SST files (data files)
Buffered I/O uses substantial memory (slab) in Linux Kernel
Better memory efficiency and lower %system CPU with Direct I/O, especially if your workload is memory bound
Adjust Block Cache accordingly, since filesystem cache can no longer be useful
Do not mix Buffered I/O and Direct I/O (serialized I/O)
use_direct_io_for_flush_and_compaction=ON
use_direct_reads=ON
cache_high_pri_pool_ratio=0.5
13. ROCKSDB PERFORMANCE
Hybrid Compression
RocksDB allows to set different compression algo between levels
Use stronger compression algorithm (Zstandard) in Lmax to save space
Use faster compression algorithm (LZ4 or None) in higher levels to keep up with writes
compression_per_level=kLZ4Compression or
kNoCompression
bottommost_compression=kZSTD
14. ROCKSDB PERFORMANCE
Avoid Compaction if possible
SST File Writer API
• It is users’ responsibility to presort rows by keys
Normal Write Path in RocksDB
….
Flush
Compaction
Compaction
Faster Write Path
15. ROCKSDB PERFORMANCE
Bloom Filter
Pay attention to Bloom Filter Size
- “optimize_filters_for_hits=true” avoids storing filter in Lmax (saving total filter size by 90%)
- Ribbon Filter saves bloom filter size by ~30% with comparable CPU util
Whole Key Filtering
Prefix Bloom Filter
16. ROCKSDB PERFORMANCE
Understand what happens with Delete
“Delete” adds a tombstone
MyRocks Update is a combination of Delete and Put
Tombstones don’t disappear until bottom level compaction happens
Some reads need to scan lots of tombstones => inefficient
• In this example, reading 5 entries is needed just for getting one row
RocksDB has an optimized API called SingleDelete, but it can’t eliminate tombstone overheads
• SingleDelete disappears when finding a matching Put. It has a requirement that same-key operations don’t repeat (e.g. Put(1) -> Put(1) -> SD(1) does not work)
• MyRocks internally uses SingleDelete for secondary keys
Put(1)
Put(2)
Put(3)
Put(4)
Put(5)
INSERT INTO t
VALUES (1),(2),(3),(4),(5);
Delete(1)
Delete(2)
Delete(3)
Delete(4)
Put(5)
DELETE FROM t WHERE
id <= 4;
Delete(1)
Delete(2)
Delete(3)
Delete(4)
Put(5)
SELECT COUNT(*) FROM t;
17. ROCKSDB PERFORMANCE
Scanning too many tombstones degrades read perf
Range scan (Seek) may hit this issue
Consecutive tombstones can be millions if you are not dealing properly
RocksDB exposes metrics as perf_context INTERNAL_DELETE_SKIPPED_COUNT, with perf context level >= 2
Operations can’t be killed during Seeking tombstones
Deletion-Triggered Compaction (DTC) is one of the workarounds
• When creating new SST files, if there are certain number of tombstones, trigger another compaction to wipe tombstones immediately
﹘ MyRocks has a sysvar to control that (rocksdb_compaction_sequential_deletes = 49999 / rocksdb_compaction_sequential_deletes_window = 50000)
﹘ RocksDB has an API to do that
﹘ Trade offs between high read cost and more compaction cost
18. ROCKSDB PERFORMANCE
Slowdown because of too many point lookups
Point Lookup calls Get(). This is more expensive than point lookup from B+Tree
May hit RocksDB LRU block cache contentions
• Visible as high %system CPU if that’s the case
• Improvements in RocksDB in progress
Typical workarounds
• Use MultiGet API
﹘ Instead of Get() x N times, issue one MultiGet()
﹘ MyRocks uses MultiGet when setting optimizer_switch = ‘mrr=on,mrr_cost_based=off, batched_key_access=on’
• Adding more secondary indexes (different key/values)
﹘ Convert non-covering index scans (1 + N reads) to covering index scans (1 or 1 + small number of reads)
﹘ Cost to update secondary index is cheaper in LSM thanks to skipping reads
20. ROCKSDB RELIABILITY
Preventing Write Stall
Write Stalling is one of the most common problems in RocksDB/LSM
Write stalls because:
• Writing too fast
• L0 flush and compactions are not fast enough
• Creating too many L0 files
• Too many pending compaction bytes
• Inefficient CompactRange API usage
• Wrong Bulk Loading API usage (loading SST file into L0 instead of Lmax,
invoking full compactions)
Write stall stats are available from status counters and LOGs
mysql> show global status like 'rocksdb_stall%';
+----------------------------------------------------+-------+
| Variable_name | Value |
+----------------------------------------------------+-------+
| rocksdb_stall_l0_file_count_limit_slowdowns | 0 |
| rocksdb_stall_locked_l0_file_count_limit_slowdowns | 0 |
| rocksdb_stall_l0_file_count_limit_stops | 0 |
| rocksdb_stall_locked_l0_file_count_limit_stops | 0 |
| rocksdb_stall_pending_compaction_limit_stops | 0 |
| rocksdb_stall_pending_compaction_limit_slowdowns | 0 |
| rocksdb_stall_memtable_limit_stops | 0 |
| rocksdb_stall_memtable_limit_slowdowns | 0 |
| rocksdb_stall_total_stops | 0 |
| rocksdb_stall_total_slowdowns | 0 |
| rocksdb_stall_micros | 0 |
+----------------------------------------------------+-------+
11 rows in set (0.00 sec)
2022/02/15-21:03:46.600403 7f5f077ff700 [WARN] [db/column_family.cc:929] [default]
Stopping writes because of estimated pending compaction bytes 1041689026590
21. ROCKSDB RELIABILITY
MemTable/L0 Stalls
If all MemTables get full, and if they can’t be flushed (e.g. max L0 files), further writes are blocked
Reported as these counters
• stall_memtable_limit_stops | slowdowns
• stall_l0_file_count_limit_stops | slowdowns
• stall_total_stops | slowdowns
Common workarounds
• Allow more L0 files -- Increase level0_slowdown_writes_trigger and level0_stop_writes_trigger (typically 20 | 30)
• Make MemTable flush faster -- use faster compression algorithm in L0 (kNoCompression, kLZ4Compression)
• Make L0 compactions faster – use faster compression algorithm in L1, 2
• Start compaction earlier -- decrease level0_file_num_compaction_trigger (typically 4)
• Be careful about implicit Flush in RocksDB (e.g. SetOptions, CheckPoint) since it creates a L0 file
22. ROCKSDB RELIABILITY
Metrics to watch
RocksDB has two important metrics structures
- Stats (e.g. stalls, data/index/filter block cache hit/miss, compaction bytes)
- Perf Context (e.g. tombstone scanned, block decompressed time)
- perf_context_level >= 2 is recommended to get most useful info like tombstone scanned.
3 is a little expensive to get time stats
MyRocks exposes most metrics via information_schema and show global status
mysql> select * from rocksdb_perf_context_global;
+---------------------------------+-----------------+
| STAT_TYPE | VALUE |
+---------------------------------+-----------------+
| USER_KEY_COMPARISON_COUNT | 270471364854 |
| BLOCK_CACHE_HIT_COUNT | 7014318274 |
| BLOCK_READ_COUNT | 555394733 |
| BLOCK_READ_BYTE | 4359686643590 |
| BLOCK_READ_TIME | 67045272264489 |
| BLOCK_CHECKSUM_TIME | 2065141339797 |
| BLOCK_DECOMPRESS_TIME | 27036226090470 |
| GET_READ_BYTES | 604107492243 |
| MULTIGET_READ_BYTES | 26614080073 |
| ITER_READ_BYTES | 4515817650181 |
| INTERNAL_KEY_SKIPPED_COUNT | 64344684548 |
| INTERNAL_DELETE_SKIPPED_COUNT | 1141058309 |
| INTERNAL_RECENT_SKIPPED_COUNT | 8580663 |
| INTERNAL_MERGE_COUNT | 0 |
| GET_SNAPSHOT_TIME | 478716678460 |
| GET_FROM_MEMTABLE_TIME | 3107700425345 |
| GET_FROM_MEMTABLE_COUNT | 1745423505 |
| GET_POST_PROCESS_TIME | 579743978173 |
| GET_FROM_OUTPUT_FILES_TIME | 102555066991914 |
| SEEK_ON_MEMTABLE_TIME | 226655444780 |
| SEEK_ON_MEMTABLE_COUNT | 104572447 |
| NEXT_ON_MEMTABLE_COUNT | 38671332 |
| PREV_ON_MEMTABLE_COUNT | 2687679 |
| SEEK_CHILD_SEEK_TIME | 23240171176784 |
| SEEK_CHILD_SEEK_COUNT | 668676730 |
…
23. ROCKSDB RELIABILITY
Most configurations are Dynamic
RocksDB has database level and column family level configurations
Majority of the configurations are column family level
You can change most RocksDB configuration parameters without stopping database
Parameter change examples:
• Decreasing Block cache size to avoid Memory Pressure
• Increasing L0 file limits to avoid L0 stalls
• Changing compression algorithm (effective on next Flush/Compaction)
Column Family parameter change (SetOptions API) involves MemTable Flush. So if you hit L0 stop, you can’t change parameters (fix in roadmap)
24. ROCKSDB RELIABILITY
I/O Error Handling
RocksDB returns an error to a caller on I/O errors, and it’s up to RocksDB users for how to handle
• Normally users get kIOError but it’s not guaranteed (e.g. kIncompelte)
Typical failure handling on errors
• Aborting server
• Returning errors
• Retrying
• In any case, don’t suppress errors
25. ROCKSDB RELIABILITY
I/O Error Handling in MyRocks
Can’t roll back on errors at engine commit. So we abort server instead,
and let crash recovery resolve binlog-engine consistency.
26. ROCKSDB RELIABILITY
Unique Key Constraints
RocksDB API Put() does not check if the same key exists or not.
Unlike INSERT in InnoDB, Put() does not return “key already exists” error
Call Get() for checking existence
Call GetForUpdate() to lock the key
MyRocks INSERT wraps with GetForUpdate() and Put(), so it can find unique key violation
You have a choice to blindly insert without reading at all (MyRocks REPLACE has an option to do that)
27. ROCKSDB RELIABILITY
Data consistency
When you physically copy RocksDB database elsewhere, make sure you copy all dependent files – SST files, WAL, Manifest, blob files
• Several online copy solutions – RocksDB backup engine, myrocks_hotbackup, xtrabackup
By default, RocksDB allows to open database even if missing WAL files
• This may end up opening database with inconsistency
• This is because Manifest file does not track WAL files
Use more strict option to enforce file integrity
• Turn track_and_verify_wals_in_manifest on
﹘ This tracks WAL file and size
﹘ Opening database with missing WALs is rejected
28. ROCKSDB RELIABILITY
Recovery on Database Crash
RocksDB has a parameter called wal_recovery_mode
• RocksDB default is 2 (kPointInTimeRecovery)
• It used to have default 1 (kAbsoluteConsistency)
• 1 has a side effect that it blocks to open RocksDB database, even if it can be recovered
Instance crash (incl process crash) may leave the tail WAL file incomplete
RocksDB refuses to start with param value 1 (kAbsoluteConsistency)
RocksDB does NOT refuse with param value 2 (kPointInTimeRecovery)
General recommendation:
• Use wal_recovery_mode=2 with track_and_verify_wals_in_manifest=ON
• Rely on replication to recover lost transactions
2022-03-26T02:21:26.166366-07:00 0 [Note] [MY-000000] [Server] RocksDB: Opening TransactionDB...
2022-03-26T02:21:28.620095-07:00 0 [ERROR] [MY-000000] [Server] RocksDB: Error opening instance, Status Code: 2, Status: Corruption: truncated record body
29. OTHER TOPICS
Dealing with Snapshot Conflicts
InnoDB natively supports range lock (next key lock / gap lock) by default
• This was for historical reason to work with Statement Based Binary Logging in MySQL
• Often caused hot row lock contentions
• Range lock is not held with Row Based Binary Logging + Read Committed Isolation Level
RocksDB (and many other databases including PostgreSQL) do not support range lock
• There is an ongoing work to support in RocksDB with external contributor
PostgreSQL Repeatable Read (and Serializable) returns “Snapshot Conflict” error on conflicts
MyRocks uses RocksDB TransactionDB and implements the same behavior
You can’t eliminate “snapshot conflict” errors with Repeatable Read / Serializable isolation without range lock
Handling errors, or switching to Read Committed are typical workarounds
30. OTHER TOPICS
InnoDB to MyRocks/RocksDB migration steps
InnoDB RR (Repeatable Read) -> InnoDB RC (Read Committed) -> MyRocks RC
• Evaluate if there are queries depending on gap lock
﹘ Meta-MySQL feature: gap_lock_write_log and gap_lock_raise_error are sysvars to help
• InnoDB RC to MyRocks RC is straight forward
InnoDB RR -> MyRocks RR (-> MyRocks RC)
• Evaluate if there are noticeable number of snapshot conflict errors
﹘ rocksdb_snapshot_conflict_errors is a status counter to tell how often hit snapshots
﹘ Users see ‘Snapshot Conflict’ error message with ‘DEADLOCK’ error code
• Flipping from RR to RC eliminates snapshot conflict errors
﹘ But it is necessary to verify if RC is safe
31. Summary
RocksDB is a modern LSM database library, with years of production deployments at scale
Compared to B+Tree, RocksDB (LSM) saves space and offers faster write performance, but pay attention to read performance drops
Pay attention to data, index and filter block size and cache miss
Utilize compression and compaction tuning options
Pay attention to tombstone scanning costs, and utilize several mitigations like Deletion Triggered Compaction
Pay attention to write stalls