RocksDB is an embedded key-value store that is optimized for fast storage. It uses a log-structured merge-tree to organize data on storage. Optimizing RocksDB for open-channel SSDs would allow controlling data placement to exploit flash parallelism and minimize overhead. This could be done by mapping RocksDB files like SSTables and logs to virtual blocks that map to physical flash blocks in a way that considers data access patterns and flash characteristics. This would improve performance by reducing writes and garbage collection.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Kafka is becoming an ever more popular choice for users to help enable fast data and Streaming. Kafka provides a wide landscape of configuration to allow you to tweak its performance profile. Understanding the internals of Kafka is critical for picking your ideal configuration. Depending on your use case and data needs, different settings will perform very differently. Lets walk through performance essentials of Kafka. Let's talk about how your Consumer configuration, can speed up or slow down the flow of messages to Brokers. Lets talk about message keys, their implications and their impact on partition performance. Lets talk about how to figure out how many partitions and how many Brokers you should have. Let's discuss consumers and what effects their performance. How do you combine all of these choices and develop the best strategy moving forward? How do you test performance of Kafka? I will attempt a live demo with the help of Zeppelin to show in real time how to tune for performance.
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...HostedbyConfluent
Apache Kafka is one of the most commonly used connectors with Apache Flink for exactly-once streaming use cases. The combination of both systems allows you to build mission-critical systems that require low end-to-end latency and exactly-once processing eg. banks processing transactions. In Apache Flink 1.14, we released a new KafkaSink based on Apache Flink’s unified Sink interface that natively supports streaming and batch executions.
However, we needed to stretch Kafka’s transactions API to fully support exactly-once processing in Flink. In this talk, we will start with a quick recap of Apache Kafka’s transactions and Flink’s checkpointing mechanism. Then, we describe the two-phase commit protocol implemented in KafkaSink in-depth and emphasize the difficulties we have overcome when applying Kafka’s transaction API to longer-lasting transactions.
We explain how we ensure performant writing to Apache Kafka and how the KafkaSink recovery works.
In summary, this talk should give users a deep dive into how Apache Flink leverages Apache Kafka’s transactions and developers an overview of what they have to consider when using Apache Kafka’s transactions.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Kafka is becoming an ever more popular choice for users to help enable fast data and Streaming. Kafka provides a wide landscape of configuration to allow you to tweak its performance profile. Understanding the internals of Kafka is critical for picking your ideal configuration. Depending on your use case and data needs, different settings will perform very differently. Lets walk through performance essentials of Kafka. Let's talk about how your Consumer configuration, can speed up or slow down the flow of messages to Brokers. Lets talk about message keys, their implications and their impact on partition performance. Lets talk about how to figure out how many partitions and how many Brokers you should have. Let's discuss consumers and what effects their performance. How do you combine all of these choices and develop the best strategy moving forward? How do you test performance of Kafka? I will attempt a live demo with the help of Zeppelin to show in real time how to tune for performance.
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...HostedbyConfluent
Apache Kafka is one of the most commonly used connectors with Apache Flink for exactly-once streaming use cases. The combination of both systems allows you to build mission-critical systems that require low end-to-end latency and exactly-once processing eg. banks processing transactions. In Apache Flink 1.14, we released a new KafkaSink based on Apache Flink’s unified Sink interface that natively supports streaming and batch executions.
However, we needed to stretch Kafka’s transactions API to fully support exactly-once processing in Flink. In this talk, we will start with a quick recap of Apache Kafka’s transactions and Flink’s checkpointing mechanism. Then, we describe the two-phase commit protocol implemented in KafkaSink in-depth and emphasize the difficulties we have overcome when applying Kafka’s transaction API to longer-lasting transactions.
We explain how we ensure performant writing to Apache Kafka and how the KafkaSink recovery works.
In summary, this talk should give users a deep dive into how Apache Flink leverages Apache Kafka’s transactions and developers an overview of what they have to consider when using Apache Kafka’s transactions.
Netflix’s architecture involves thousands of microservices built to serve unique business needs. As this architecture grew, it became clear that the data storage and query needs were unique to each area; there is no one silver bullet which fits the data needs for all microservices. CDE (Cloud Database Engineering team) offers polyglot persistence, which promises to offer ideal matches between problem spaces and persistence solutions. In this meetup you will get a deep dive into the Self service platform, our solution to repairing Cassandra data reliably across different datacenters, Memcached Flash and cross region replication and Graph database evolution at Netflix.
In this talk, we'll walk through RocksDB technology and look into areas where MyRocks is a good fit by comparison to other engines such as InnoDB. We will go over internals, benchmarks, and tuning of MyRocks engine. We also aim to explore the benefits of using MyRocks within the MySQL ecosystem. Attendees will be able to conclude with the latest development of tools and integration within MySQL.
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Communication between Microservices is inherently unreliable. These integration points may produce cascading failures, slow responses, service outages. We will walk through stability patterns like timeouts, circuit breaker, bulkheads and discuss how they improve stability of Microservices.
Linux Memory Management with CMA (Contiguous Memory Allocator)Pankaj Suryawanshi
Fundamentals of Linux Memory Management and CMA (Contiguous Memory Allocator) In Linux.
Virtual Memory, Physical Memory, Swap Space, DMA, IOMMU, Paging, Segmentation, TLB, Hugepages, Ion google memory manager
VictoriaLogs: Open Source Log Management System - PreviewVictoriaMetrics
VictoriaLogs Preview - Aliaksandr Valialkin
* Existing open source log management systems
- ELK (ElasticSearch) stack: Pros & Cons
- Grafana Loki: Pros & Cons
* What is VictoriaLogs
- Open source log management system from VictoriaMetrics
- Easy to setup and operate
- Scales vertically and horizontally
- Optimized for low resource usage (CPU, RAM, disk space)
- Accepts data from Logstash and Fluentbit in Elasticsearch format
- Accepts data from Promtail in Loki format
- Supports stream concept from Loki
- Provides easy to use yet powerful query language - LogsQL
* LogsQL Examples
- Search by time
- Full-text search
- Combining search queries
- Searching arbitrary labels
* Log Streams
- What is a log stream?
- LogsQL examples: querying log streams
- Stream labels vs log labels
* LogsQL: stats over access logs
* VictoriaLogs: CLI Integration
* VictoriaLogs Recap
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
The increasing challenge to serve ever-growing data driven by AI and analytics workloads makes disaggregated storage and compute more attractive as it enables companies to scale their storage and compute capacity independently to match data & compute growth rate. Cloud based big data services is gaining momentum as it provides simplified management, elasticity, and pay-as-you-go model.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
Apache BookKeeper is a high performance and low latency storage service optimized for storing immutable and append-only data (such as log, streaming events, and objects). Sijie Guo and JV shares the experienced with Apache BookKeeper. This talk covers the motivation and overview of BookKeeper, dives into implementation details and describes the use cases built upon it.
Speaker: Jay Runkel, Principal Solution Architect, MongoDB
Session Type: 40 minute main track session
Track: Operations
When architecting a MongoDB application, one of the most difficult questions to answer is how much hardware (number of shards, number of replicas, and server specifications) am I going to need for an application. Similarly, when deploying in the cloud, how do you estimate your monthly AWS, Azure, or GCP costs given a description of a new application? While there isn’t a precise formula for mapping application features (e.g., document structure, schema, query volumes) into servers, there are various strategies you can use to estimate the MongoDB cluster sizing. This presentation will cover the questions you need to ask and describe how to use this information to estimate the required cluster size or cloud deployment cost.
What You Will Learn:
- How to architect a sharded cluster that provides the required computing resources while minimizing hardware or cloud computing costs
- How to use this information to estimate the overall cluster requirements for IOPS, RAM, cores, disk space, etc.
- What you need to know about the application to estimate a cluster size
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Netflix’s architecture involves thousands of microservices built to serve unique business needs. As this architecture grew, it became clear that the data storage and query needs were unique to each area; there is no one silver bullet which fits the data needs for all microservices. CDE (Cloud Database Engineering team) offers polyglot persistence, which promises to offer ideal matches between problem spaces and persistence solutions. In this meetup you will get a deep dive into the Self service platform, our solution to repairing Cassandra data reliably across different datacenters, Memcached Flash and cross region replication and Graph database evolution at Netflix.
In this talk, we'll walk through RocksDB technology and look into areas where MyRocks is a good fit by comparison to other engines such as InnoDB. We will go over internals, benchmarks, and tuning of MyRocks engine. We also aim to explore the benefits of using MyRocks within the MySQL ecosystem. Attendees will be able to conclude with the latest development of tools and integration within MySQL.
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Communication between Microservices is inherently unreliable. These integration points may produce cascading failures, slow responses, service outages. We will walk through stability patterns like timeouts, circuit breaker, bulkheads and discuss how they improve stability of Microservices.
Linux Memory Management with CMA (Contiguous Memory Allocator)Pankaj Suryawanshi
Fundamentals of Linux Memory Management and CMA (Contiguous Memory Allocator) In Linux.
Virtual Memory, Physical Memory, Swap Space, DMA, IOMMU, Paging, Segmentation, TLB, Hugepages, Ion google memory manager
VictoriaLogs: Open Source Log Management System - PreviewVictoriaMetrics
VictoriaLogs Preview - Aliaksandr Valialkin
* Existing open source log management systems
- ELK (ElasticSearch) stack: Pros & Cons
- Grafana Loki: Pros & Cons
* What is VictoriaLogs
- Open source log management system from VictoriaMetrics
- Easy to setup and operate
- Scales vertically and horizontally
- Optimized for low resource usage (CPU, RAM, disk space)
- Accepts data from Logstash and Fluentbit in Elasticsearch format
- Accepts data from Promtail in Loki format
- Supports stream concept from Loki
- Provides easy to use yet powerful query language - LogsQL
* LogsQL Examples
- Search by time
- Full-text search
- Combining search queries
- Searching arbitrary labels
* Log Streams
- What is a log stream?
- LogsQL examples: querying log streams
- Stream labels vs log labels
* LogsQL: stats over access logs
* VictoriaLogs: CLI Integration
* VictoriaLogs Recap
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
The increasing challenge to serve ever-growing data driven by AI and analytics workloads makes disaggregated storage and compute more attractive as it enables companies to scale their storage and compute capacity independently to match data & compute growth rate. Cloud based big data services is gaining momentum as it provides simplified management, elasticity, and pay-as-you-go model.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
Apache BookKeeper is a high performance and low latency storage service optimized for storing immutable and append-only data (such as log, streaming events, and objects). Sijie Guo and JV shares the experienced with Apache BookKeeper. This talk covers the motivation and overview of BookKeeper, dives into implementation details and describes the use cases built upon it.
Speaker: Jay Runkel, Principal Solution Architect, MongoDB
Session Type: 40 minute main track session
Track: Operations
When architecting a MongoDB application, one of the most difficult questions to answer is how much hardware (number of shards, number of replicas, and server specifications) am I going to need for an application. Similarly, when deploying in the cloud, how do you estimate your monthly AWS, Azure, or GCP costs given a description of a new application? While there isn’t a precise formula for mapping application features (e.g., document structure, schema, query volumes) into servers, there are various strategies you can use to estimate the MongoDB cluster sizing. This presentation will cover the questions you need to ask and describe how to use this information to estimate the required cluster size or cloud deployment cost.
What You Will Learn:
- How to architect a sharded cluster that provides the required computing resources while minimizing hardware or cloud computing costs
- How to use this information to estimate the overall cluster requirements for IOPS, RAM, cores, disk space, etc.
- What you need to know about the application to estimate a cluster size
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
ARM server, The Cy7 Introduction by Aaron Joue, Ambedded TechnologyAaron Joue
The world first ARM Server for cloud storage. It is compatible with Hadoop, GlusterFS, Ceph. Each node consume less than 2.5 Watts. Very high density with 1824TB in a rack.
Fusion-io Memory Flash for Microsoft SQL Server 2012Mark Ginnebaugh
You've heard about Solid State Drives (SSDs), and might be using them now. To get dramatically improved IO performance, you need Flash Memory – storage that can be connected to your server’s Bus, and really maximize IO.
Fusion-io is an industry leader in this area, and Sumeet Bansal explains how to best employ this powerful technology. You'll learn:
* The many ways Flash can help your SQL Server performance, while at the same time lowering costs
* How you can use Flash optimally for your SQL Server deployment
* Easy, low risk ways to introduce ioMemory into SQL Server environments to instantly realize significant benefits.
* How to implement ioMemory optimally for the most pervasive configurations of SQL Server
Energy Saving ARM Server Cluster Born for Distributed Storage & ComputingAaron Joue
Innovative ARM based server cluster in a 1U server chassis. Multi-Node ARM server scale out computing and storage simultaneously. It is designed for distributed storage and computing. Integrated with Ceph software to provide a energy saving software defined storage appliance.
Fog Computing is a paradigm that extends Cloud computing and services to the edge of the network. Similar to Cloud, Fog provides data, compute, storage, and application services to end-users. The motivation of Fog computing lies in a series of real scenarios, such as Smart Grid, smart traffic lights in vehicular networks and software defined networks.
In this presentation, we introduce liblightnvm, a user space library that manages provisioning and I/O submission for physical flash.
We argue how liblightnvm can benefit I/O-intensive applications by providing predictable latency and reducing device write amplification, thus prolonging the device's endurance. We show how to integrate liblightnvm with RocksDB.
The rapid growth of in-memory compute applications is not surprising given the tremendous performance gains they can offer. Jobs that used to take hours can now take minutes or seconds as they are no longer subject to the rotational and seek latencies of spinning media. While Flash memory provide some relief, it is still a hundred times slower than the DRAM that in-memory compute applications utilize as their primary storage.
One drawback to in-memory compute applications is the high cost associated with DRAM. Not only are the acquisition costs an order of magnitude more expensive than Flash, DRAM consumes far more power. Power can be a significant issue in data centers besides contributing a major part to the operational costs. In addition, a single server has limited capacity for DRAM and datasets that are larger, need to find an alternate solution or cope with the nuisance of sharding. Furthermore, in order to utilize the maximum capacity of DRAM in a server, higher cost DRAM needs to be installed further escalating the cost of compute.
We discuss a paradigm to allow in-memory computing applications to extend their capacity by utilizing Flash memory; often with minimal performance loss. We give examples of applications that have been modified to utilize the paradigm and show performance comparisons. We also discuss TCO and the relative cost per transaction of the different solutions.
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
Comparing the burst buffers of today, such as the Cray DataWarp-based burst buffer implemented on NERSC Cori, to the proto-burst buffer deployed on SDSC's Gordon supercomputer in 2012.
The presentation provides you with the necessary steps to follow when migrating to XtraDB Cluster.
Percona provides an in-depth review of your database and recommends appropriate changes by performing a complete MySQL health check in which we identify inefficiencies, find problems before they occur, and ensure that your MySQL database is in the best condition.
Learn how upcoming changes in the persistent memory market will affect deployments of in-memory computing and traditional applications. Using software innovations from SanDisk and the broad portfolio of flash storage hardware options, customers and developers can optimize applications for “flash extended memory”, the intersection of in-memory computing and persistent memory technologies.
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
Get a look under the hood: Understand how to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. You’ll also hear about how the University of Technology Sydney (UTS) are using Redshift. The University of Technology Sydney will describe how utilizing Amazon Redshift enabled agility in dealing with Data Quality, a capacity to scale when required, and optimizing development processes through rapid provisioning of Data Warehouse environments.
Speaker: Ganesh Raja, Solutions Architect, Amazon Web Services with Susan Gibson, Manager, Data and Business Intelligence, UTS
Level: 300
With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0.
We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
on-Volatile-Memory express (NVMe) standard promises and order of magnitude faster storage than regular SSDs, while at the same time being more economical than regular RAM on TB/$. This talk evaluates the use cases and benefits of NVMe drives for its use in Big Data clusters with HBase and Hadoop HDFS.
First, we benchmark the different drives using system level tools (FIO) to get maximum expected values for each different device type and set expectations. Second, we explore the different options and use cases of HBase storage and benchmark the different setups. And finally, we evaluate the speedups obtained by the NVMe technology for the different Big Data use cases from the YCSB benchmark.
In summary, while the NVMe drives show up to 8x speedup in best case scenarios, testing the cost-efficiency of new device technologies is not straightforward in Big Data, where we need to overcome system level caching to measure the maximum benefits.
Database as a Service on the Oracle Database Appliance PlatformMaris Elsins
Speaker: Marc Fielding, Co-speaker: Maris Elsins.
Oracle Database Appliance provides a robust, highly-available, cost-effective, and surprisingly scalable platform for database as a service environment. By leveraging Oracle Enterprise Manager's self-service features, databases can be provisioned on a self-service basis to a cluster of Oracle Database Appliance machines. Discover how multiple ODA devices can be managed together to provide both high availability and incremental, cost-effective scalability. Hear real-world lessons learned from successful database consolidation implementations.
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive
Dhruba Borthakur, Facebook
Dhruba Borthakur is an engineer at Facebook. He has been one of the founding engineer of RocksDB, an open-source key-value store optimized for storing data in flash and main-memory storage. He has been one of the founding architects of the Apache Hadoop Distributed File System and has been instrumental in scaling Facebook's Hadoop cluster to multiples of petabytes. Dhruba has contributed code to the Apache HBase project. Earlier, he contributed to the development of the Andrew File System (AFS). He has an M.S. in Computer Science from the University of Wisconsin, Madison and a B.S. in Computer Science BITS, Pilani, India.
With employees based in countries around the globe which provide 24x7 services to MySQL users worldwide, Percona provides enterprise-grade MySQL Support, Consulting, Training, Managed Services, and Server Development services to companies ranging from large organizations, such as Cisco Systems, Alcatel-Lucent, Groupon, and the BBC, to recent startups building MySQL-powered solutions for businesses and consumers.
Similar to Optimizing RocksDB for Open-Channel SSDs (20)
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
1. 1
Towards
Application
Driven
Storage
Optimizing
RocksDB
for
Open-‐Channel
SSDs
Javier González <javier@cnexlabs.com>
LinuxCon Europe 2015
Contributors: Matias Bjørling and Florin Petriuc
2. Application
Driven
Storage:
What
is
it?
2
RocksDB
Metadata
Mgmt.
Standard
Libraries
User
SpaceKernel
Space
App-‐specific
opt.
Page
cache
Block
I/O
interface
FS-‐specific
logic
3. Application
Driven
Storage:
What
is
it?
• Application
Driven
Storage
- Avoid multiple (redundant)
translation layers
- Leverage optimization
opportunities
- Minimize overhead when
manipulating persistent data
- Make better decisions regarding
latency, resource utilization, and
data movement (compared to
best-‐effort techniques today)
3
RocksDB
Metadata
Mgmt.
Standard
Libraries
User
SpaceKernel
Space
App-‐specific
opt.
Page
cache
Block
I/O
interface
FS-‐specific
logic
Generic <> Optimized
➡ Motivation:
Give
the
tools
to
the
applications
that
know
how
to
manage
their
own
storage
4. Application
Driven
Storage
Today
• Arrakis
(https://arrakis.cs.washington.edu)
- Remove the OS kernel entirely from normal application execution
• Samsung
multi
stream
- Let the SSD know from where “I/O streams” emerge to make better
decisions
• Fusion
I/O
- Dedicated I/O stack to support a specific type of hardware
• Open-‐Channel
SSDs
- Expose SSD characteristics to the host and give it full control over its
storage
4
5. Traditional
Solid
State
Drives
• Flash
complexity
is
abstracted
away
form
the
host
by
an
embedded
Flash
Translation
Layer
(FTL)
- Maps logical addresses (LBAs) to physical addresses (PPAs)
- Deals with flash constrains (next slide)
- Has enabled adoption by making SSDs compliant with the existing I/O stack
5
High throughput + Low latency Parallelism + Controller
6. Page 0
Page 1
Page 2
Page n -1
…
StateOOBData
Flash
memory
101
6
• Flash
constrains:
- Write at a page granularity
• Page state: Valid, invalid, erased
- Write sequentially in a block
- Write always to an erased page
• Page becomes valid
- Updates are written to a new page
• Old pages become invalid -‐ need for GC
- Read at a page granularity (seq./random reads)
- Erase at a block granularity (all pages in block)
• Garbage collection (GC):
• Move valid pages to new block
• Erase valid and invalid pages -‐> erased state
7. • Open
Channel
SSDs
share
control
responsibilities
with
the
Host
in
order
to
implement
and
maintain
features
that
typical
SSDs
implemented
strictly
in
the
SSD
device
firmware
Open-‐Channel
SSDs:
Overview
7
• Host-‐based
FTL
manages:
- Data placement
- I/O scheduling
- Over-‐provisioning
- Garbage collection
- Wear-‐leveling
• Host
needs
to
know:
- SSD features & responsibilities
- SSD geometry
• NAND media idiosyncrasies
• Die geometry (blocks & pages)
• Channels, timings, etc.
• Bad blocks & ECC
Host
manages
physical
flash Application
Driven
Storage
Physical flash exposed to the
host (Read, Write, Erase)
8. Open-‐Channel
SSDs:
LightNVM
8
Key-Value/Object/FS/
Block/etc.
Block Target Direct Flash Target
File-System
Block Manager (Generic, Vendor-specific, ...)
Open-Channel SSDs (NVMe, PCI-e, RapidIO, ...)
Kernel
User-space
Block Copy
Engine
Metadata State
Mgmt.
Bad Block State
Mgmt.
XOR Engine ECC Engine
Error Handling
Etc.GC Engine
Raw
NAND
Geometry
Managed
Geometry
Vendor-Specific Target
HardwareSoftware
LightNVM
Framework
• Targets
- Expose physical media to user-‐
space
• Block
Managers
- Manage physical SSD
characteristics
- Evens out wear-‐leveling across
all flash
• Open-‐Channel
SSD
- Responsibility
- Offload engines
10. Open-‐Channel
SSDs:
Challenges
1. Which
classes
of
applications
would
benefit
most
from
being
able
to
manage
physical
flash?
- Modify storage backend (i.e., no posix)
- Probably no file system, page cache, block I/O interface, etc.
2. Which
changes
do
we
need
to
make
on
these
applications?
- Make them work on Open-‐Channel SSDs
- Optimize them to take advantage of directly using physical flash (e.g., data
structures, file abstractions, algorithms).
3. Which
interfaces
would
(i)
make
the
transition
simpler,
and
(ii)
simultaneously
cover
different
classes
of
applications?
10
➡ New
paradigm
that
we
need
to
explore
in
the
whole
I/O
stack
11. RocksDB:
Overview
11
• Embedded
Key-‐Value
persistent
store
• Based
on
Log-‐Structured
Merge
Tree
• Optimized
for
fast
storage
• Server
workloads
• Fork
from
LevelDB
• Open
Source:
- https://github.com/facebook/rocksdb
• RocksDB
is
not:
- Not distributed
- No failover
- Not highly available
RocksDB
Reference: The Story of RocksDB, Dhruba
Borthakur
and
Haobo
Xu
(link)
The Log-‐Structured Merge-‐Tree, Patrick O'Neil,
Edward
Cheng
Dieter
Gawlick,
Elizabeth
O’Neil.
Acta Informatica, 1996.
13. Problem:
RocksDB
Storage
Backend
13
sst
sst
sst sst
sst sst
User Data MetadataDB Log
WAL
… WALWALWAL
Manifest Manifest Manifest…
Current
Other
LOG (Info)LOCK IDENTITY
Storage Backend
LSM Logic
Posix HDFS Win
RocksDB LSM
• Storage
backend
decoupled
from
LSM
- WritableFile(): Sequential writes -‐> Only way to write to
secondary storage
- SequentialFile() -‐> Sequential reads. Used primarily for
sstable user data and recovery
- RandomAccessFile() -‐> Random reads. Used primarily for
metadata (e.g., CRC checks)
- Sstable: Persistent memtable
- DB
Log: Write-‐Ahead Log (WAL)
- MANIFEST: File metadata
- IDENTITY:
Instance ID
- LOCK:
use_existing_db
- CURRENT: Superblock
- Info
Log: Log & Debugging
14. RocksDB:
LSM
using
Physical
Flash
• Objective:
Fully
optimize
RocksDB
for
Flash
memories
- Control data placement:
• User data in sstables is close in the physical media (same block, adjacent blocks)
• Same for WAL and MANIFEST
- Exploit Parallelism:
• Define virtual blocks based on file write patters in the storage backend
• Get blocks from different luns based on RocksDB’s LSM write patters
- Schedule GC and minimize over-‐provisioning
• Use LSM sstable merging strategies to minimize (and ideally remove) the need for GC and
over-‐provisioning on the SSD
- Control I/O scheduling
• Prioritize I/Os based on the LSM persistent needs (e.g., L0 and WAL have higher priority
than levels used for compacted data to maximize persistency in case of power loss)
14
➡ Implement
an
FTL
optimized
for
RocksDB,
which
can
be
reused
for
similar
applications
(e.g.,
LevelDB,
Cassandra,
MongoDB)
15. RocksDB
+
DFlash:
Challenges
• Sstables
(persistent
memtables)
- P1: Fit block sizes in L0 and further level (merges + compactions)
• No need for GC on SSD side -‐ RocksDB merging as GC (less write and space amplification)
- P2:
Keep block metadata to reconstruct sstable in case of host crash
• WAL
(Write-‐Ahead
Log)
and
MANIFEST
- P3:
Fit block sizes (same as in sstables)
- P4:
Keep block metadata to reconstruct the log in case of host crash
• Other
Metadata
- P5:
Keep superblock metadata and allow to recover the database
- P6:
Keep other metadata to account for flash constrains (e.g., partial pages,
bad pages, bad blocks)
• Process
- P7: Follow RocksDB architecture -‐ upstreamable solution
15
16. Arena&block&(kArenaSize)
(Op3mal&size&~&1/10&of&write_buffer_size)
write_buffer_size
…
Flash'Block'0 Flash'Block'1
nppas'*'PAGE_SIZE
offset'0
gid:'128 gid:'273 gid:'481
EOF
space'amplificaHon
(life'of'file)
P1,
P3:
Match
flash
block
size
16
• WAL and MANIFEST are
reused in future instances
until replaced
- P3: Ensure that WAL and
MANIFEST replace size fills up
most of last block
• Sstable sizes follow a heuristic -‐ MemTable::ShouldFlushNow()
P1:
- kArenaBlockSize = sizeof(block)
- Conservative heuristic in terms of overallocation
• Few lost pages is better than allocating a new block
- Flash block size becomes a “static” DB tuning
parameter that is used to optimize “dynamic” ones
➡ Optimize
RocksDB
bottom
up
(from
storage
backend
to
LSM)
DFlash File
17. P2,
P4,
P6:
Block
Metadata
17
Block&
Ending&
Metadata
Out$of$
Bound
Block&
Star+ng&
Metadata
• Blocks
can
be
checked
for
integrity
• New
DB
instance
can
append;
padding
is
maintained
in
OOB
(P6)
• Closing
a
block
updates
bad
page
&
bad
block
information
(P6)
First&Valid&Page
Intermediate&Page
Last&Page
Out&of&
Bound
Block&
Star:ng&
Metadata
Block&
Ending&
Metadata
RocksDB(data
RocksDB(data
RocksDB(data
Out&of&
Bound
Out&of&
Bound
Flash&Block
struct vblock_init_meta {
char filename[100]; // RocksDB file GID
uint64_t owner_id; // Application owning the block
size_t pos; // relative position in block
};
struct vpage_meta {
size_t valid_bytes; // Valid bytes from offset 0
uint8_t flags; // State of the page
};
struct vblock_close_meta {
size_t written_bytes; // Payload size
size_t ppa_bitmap; // Updated valid page bitmap
size_t crc; // CRC of the whole block
unsigned long next_id; // Next block ID (0 if last)
uint8_t flags; // Vblock flags
};
18. P2,
P4,
P6:
Crash
Recovery
18
• A
DFlash
file
can
be
reconstructed
from
individual
blocks
(P2,
P4)
1. Metadata for the blocks forming a DFlash file is stored in MANIFEST
• The last WAL is not guaranteed to reach the MANIFEST -‐> RECOVERY metadata for DFLASH
2. On recovery, LightNVM provides an application with all its valid blocks
3. Each block stores enough metadata to reconstruct a DFLash file
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
…
Metadata/Type:
3/Log
3/Current
3/Metadata
3/Sstable
!"Private"(Env)
Enough/metadata/to/
recover/database/in/a/
new/instance
Private"(DFlash):
vblocks/forming/the/
DFlash/File
1
BLOCK BLOCK BLOCK
3
Open%Channel*SSD
BM Block&
(ownerID)
Block&
(ownerID)
Block&
(ownerID)
…
Block&ListRecovery
2
19. 19
• CURRENT
is
used
to
store
RocksDB
“superblock”
- Points to current MANIFEST, which is used to reconstruct the DB when
creating a new instance. We append the block metadata that points to the
blocks forming the current MANIFEST (P5)
P5:
Superblock
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
OPCODE
ENCODED
METADATA
…
Metadata/Type:
3/Log
3/Current
3/Metadata
3/Sstable
!"Private"(Env)
Enough/metadata/to/
recover/database/in/a/
new/instance
Private"(DFlash):
vblocks/forming/the/
DFlash/File
MANIFEST
(DFlash)
CURRENT
(Posix)
Normal
Recovery
21. RocksDB
+
DFlash:
Prototype
(1/2)
21
• Optimize
RocksDB
for
Flash
storage
- Implement a user space append-‐only FS that deals with flash constrains
• Append-‐only: Updates are re-‐written and old data invalidated -‐> LSM understands this logic
• Page cache implemented in user space; use Direct I/O
• Only “sync” complete pages, and prefer closed blocks.
• In case of write failure, write to a new block (or mark bad page and re-‐try)
- Implement RocksDB’s file classes for DFlash:
• WritableFile(): Sequential writes -‐> Only way to write to secondary storage
• SequentialFile() -‐> Used primarily for sstable user data and recovery
• RandomAccessFile() -‐> Used primarily for metadata (e.g., CRC checks)
• Use
flash
block
as
the
central
piece
for
storage
optimizations
- The Open-‐Channel SSD fabric is configured at first
• Define block size -‐ across luns and channels to exploit parallelism
• Define different types of luns with different block features
- RocksDB is configured with standard parameters (e.g., write buffer, cache)
• DFlash backend tunes these parameters based on the type of lun and block
22. RocksDB
+
DFlash:
Prototype
(2/2)
22
• Use
LSM
merging
strategies
as
perfect
garbage
collection
(GC)
- All blocks in a DFlash file are either active or inactive -‐> no need to GC in SSD
- Reduce over-‐provisioning significantly (~5%)
- Predictable latency -‐> SSD is in stable state from the beginning
• Reuse
RocksDB
concepts
and
abstractions
as
much
as
possible
- Store private metadata in MANIFEST
- Store superblock in CURRENT
- Minimize the amount of “visible” metadata -‐ use OOB, Root FS, etc.
• Separate
persistent
(meta)data
between
“fast”
and
“static”
- Fast data is all user data (i.e., sstables) and the WAL
- Fast metadata that follows user data rates (i.e., MANIFEST)
- Static metadata is written once and seldom updated
• CURRENT: Superblock for MANIFEST
• LOCK and IDENTITY
25. Architecture:
DFlash
+
LightNVM
25
• LSM
is
the
FTL
- DFlash target and RocksDB storage
backend take care of provisioning
flash blocks
• Optimized
critical
I/O
path
- Sstables, WAL, and MANIFEST are
stored in the Open-‐Channel SSD,
where we can provide QoS
• Enable
a
RocksDB
distributed
architecture
- BM abstracts the storage fabric (e.g.,
NVM) and can potentially provide
blocks from different drives -‐> single
address space
Open%Channel*SSD
BM
DFlash
KERNEL*SPACEUSER*SPACE
Provisioning)interface Block)device
DFlash)File)(blocks)
get_block()
put_block()
DFWritableFile
DFSequen?alFile
DFRandomAccesslFile
Free)Blocks Used)Blocks Bad)Blocks
Sector)Calcula?ons
I/O)Path
Manifestsst WAL
Data)Placement
controlled)by)LSM
I/O
Env*DFlash Env*Posix
PxWritableFile
PxSequen?alFile
PxRandomAccesslFile
CURRENT
LOCK
IDENTITY
LOG*
(Info)
TradiGonal*SSD
File*System
I/O
RocksDB*LSM
Device)
Features
Env)Op?ons
Op?miza?ons
Init
Code
26. 26
RocksDB
make
release
ENTRY
KEYS
with
4
threads
DFLASH
(1
LUN)
POSIX
WRITES
10000
keys 70MB/s 25MB/s
100000
keys 40MB/s 25MB/s
1000000
keys 25MB/s 20MB/s
Page-‐aligned
write
buffer
- DFlash write page cache + Direct I/O + flash page aligned write buffer is better
optimized than best effort techniques and top-‐down optimizations (RocksDB
parameters). Write buffer required by RocksDB due to small WAL writes
- If we manually tune buffer sizes with Posix, we obtain similar results. However,
it requires lots of experimentation for each configuration
QEMU
Evaluation:
Writes
Bare
Metal:
~180MB/s
27. 27
RocksDB
make
release
ENTRY
KEYS
DFLASH
(1
LUN)
POSIX
READ 10000
keys 5MB/s 300MB/s
100000
keys 5MB/s 500MB/s
1000000
keys 5MB/s 570MB/s
No
page
cache
support
- Without a DFLASH page cache we need to issue an I/O for each read!
• Sequential 20 byte-‐reads in same page would issue different PAGE_SIZE I/Os
QEMU
Evaluation:
Reads
28. 28
RocksDB
make
release
ENTRY
KEYS
DFLASH
(1
LUN)
DFLASH
(1
LUN)
(+
simple
page
cache)
POSIX
READ 10000
keys 5MB/s 160MB/s 300MB/s
100000
keys 5MB/s 280MB/s 500MB/s
1000000
keys
5MB/s 300MB/s 570MB/s
- Posix + buffered I/O using Linux’s page cache is still better, but we have
confirmed our hypothesis.
Simple
page
cache
for
reads
➡ User-‐space
page
cache
is
a
necessary
optimization
when
the
generic
OS
page
cache
is
on
the
way.
Other
databases
use
this
technique
(e.g.,
Oracle,
MySQL)
QEMU
Evaluation:
“Fixing”
Reads
29. QEMU
Evaluation:
Insights
29
• Posix
backend
and
DFlash
backend
(with
1
lun)
should
achieve
very
similar
throughput
for
reads/writes
when
using
same
page
cache
and
write
buffer
optimizations
• But…
- DFlash allows to optimize buffer and cache sizes based on Flash
characteristics
- DFlash knows which file class is calling -‐ we can do prefetching for sequential
reads (DFSequentialFile) at block granularity
- DFlash designed to implement a Flash optimized page cache using Direct I/O
• If
the
Open-‐Channel
SSD
exposes
several
LUNs,
we
can
exploit
parallelism
within
DFlash
and
RocksDB’s
LSM
write/read
patterns
- How many luns and their characteristics are organized is controller specific
31. CNEX
WestLake
Evaluation
31
RocksDB
make
release
ENTRY
KEYS
with
4
threads
WRITES
(1
LUN)
READS
(1
LUN)
WRITES
(8
LUNS)
READS
(8
LUNS)
WRITES
(64
LUNS)
READS
(64
LUNS)
RocksDB
DFLASH 10000
keys 21MB/s 40MB/s X X X X
100000
keys
21MB/s 40MB/s X X X X
1000000
keys
21MB/s 40MB/s X X X X
Raw
DFLASH
(with
fio) 32MB/s 64MB/s 190MB/s 180MB/s 920MB/s 1,3GB/s
• We
focus
on
a
single
I/O
stream
for
first
prototype
-‐>
1
LUN
32. CNEX
WestLake
Evaluation:
Insights
32
• RocksDB
checks
for
sstable
integrity
on
writes
(intermittent
reads)
- We pay the price of not having an optimized page cache also on writes
- Reads and writes are mixed in one single lun
• Ongoing
work:
Exploit
parallelism
in
RocksDB’s
I/O
patters
RocksDB
Ac#ve&
memtable WAL… SST1 SST2 SST3 … SSTN
Merging&&&Compac#on
w w w w w wr
W R
r r r r
R/W
r
ReadBonly&
memtable
r
RO&
MT
RO&
MT
…
R R
w r
W
MANIFEST
W
w r w r
Open,Channel1SSD
Block1Manager
LUN0 LUN1 LUN2 LUN3 LUN4 LUN5 LUN6 LUN7 LUN8 LUN9…
DFlash
get_block()
put_block()
Virtual&LUN
Physical&LUN
…
…
…
…
CH0
CH1
CHN
Lun0 LunN
WestLake
- Do not mix R/W
- Different VLUN per path
- Different VLUN types
- Enabling I/O scheduling
- Block pool in DFlash
(prefetching)
33. CNEX
WestLake
Evaluation:
Insights
33
• Also,
in
any
Open-‐Channel
SSD
- DFlash will not get a performance hit when the SSD triggers GC -‐ RocksDB
does GC when merging sstables in LSM > L0
• SSD steady state is improved (and reached from the beginning)
• We achieve predictable latency
IOPS
Time
34. Status
and
ongoing
work
• Status:
- get_block/put_block interface (first iteration) through the DFlash target
- RocksDB DFlash target that plugs into LightNVM. Source code to test in
available (RocksDB, Kernel, and QEMU). Working on WestLake upstreaming
• Ongoing:
- Implement functions to increase the synergy between the LSM and the
storage backend (i.e., tune write buffer based on block size) -‐> Upstreaming
- Support libaio to enable async I/O in DFlash storage backend
• Need to deal with RocksDB design decisions (e.g., Get() assumes sync IO)
- Exploit device parallelism within RockDB internal structures
- Define different types of virtual luns and expose them to the application
- Other optimizations: double buffering, aligned memory in LSM, etc.
34
- Move
RocksDB
DFlash’s
logic
to
liblightnvm
-‐>
append-‐only
FS
for
Flash
35. Conclusions
• Application
Driven
Storage
- Demo working on real hardware:
• (RocksDB
-‐>
LightNVM
-‐>
WestLake-‐powered
SSD)
- QEMU support for testing and development
- More Open-‐Channel SSDs coming soon.
• RocksDB
- DFlash dedicated backend -‐> append-‐only FS optimized for Flash
- Set the basis for moving to a distributed architecture while guaranteeing
performance constrains (specially in terms of latency)
- Intention to upstream the whole DFlash storage backend
• LightNVM
- Framework to support Open-‐Channel SSDs in Linux
- DFlash target to support application FTLs
35
36. 36
Towards
Application
Driven
Storage
Optimizing
RocksDB
on
Open-‐Channel
SSDs
with
LightNVM
LinuxCon Europe 2015
Questions?
• Open-‐Channel SSD Project: https://github.com/OpenChannelSSD
• LightNVM: https://github.com/OpenChannelSSD/linux
• RocksDB: https://github.com/OpenChannelSSD/rocksdb
Javier González <javier@cnexlabs.com>