Scalable Filesystem Metadata Services with RocksDB

•

1 like•2,660 views

Alluxio maintainer and founding engineer Calvin Jia presents on Scalable Filesystem Metadata Services with RocksDB at the RocksDB meetup at Twitter.

Software

Scalable Filesystem Metadata Services
Calvin Jia - 07/11 RocksDB Meetup
Featuring RocksDB

● Release Manager for Alluxio 2.0.0
● Contributor since Tachyon 0.4 (2012)
● Founding Engineer @ Alluxio
About Me
Calvin Jia

Alluxio Overview
• Open source data orchestration
• Commonly used for data analytics such as OLAP on Hadoop
• Deployed at Huya, Two Sigma, Tencent, and many others
• Largest deployments of over 1000 nodes

Agenda
Architecture1
Challenges2
Solutions3

Alluxio Master
• Responsible for storing and serving metadata in Alluxio
• Alluxio Metadata consists of files and blocks
• Main data structure is the Filesystem Tree
• The namespace for files in Alluxio
• Can include mounts of other file system namespaces
• The size of the tree can be very large!

Metadata Storage Challenges
• Storing the raw metadata becomes a problem with a large number of
files
• On average, each file takes 1KB of on-heap storage
• 1 billion files would take 1 TB of heap space!
• A typical JVM runs with < 64GB of heap space
• GC becomes a big problem when using larger heaps

Metadata Serving Challenges
• File operations (ie. getStatus, create) need to be fast
• On heap data structures excel in this case
• Operations need to be optimized for high concurrency
• Generally many readers and few writers

Store 1B+ files while serving at high performance

RocksDB
• Embeddable
• Key-Value interface
• LSMT based storage (sorted)
• Has Java API
• Vibrant community

Tiered Metadata Storage = 1 Billion Files
14
Alluxio Master
Local Disk
RocksDB (Embedded)
● Inode Tree
● Block Map
● Worker Block Locations
On Heap
● Inode Cache
● Mount Table
● Locks

Working with RocksDB
• Abstract the metadata storage layer
• Redesign the data structure representation of the Filesystem Tree
• Each inode is represented by a numerical ID
• Edge table maps <ID,childname> to <ID of child> Ex: <1foo, 2>
• Inode table maps <ID> to <Metadata blob of inode> Ex: <2, proto>
• Two table solution provides good performance for common
operations
• One lookup for listing by using prefix scan
• Path depth lookups for tree traversal
• Constant number of inserts for updates/deletes/creates

Example RocksDB Operations
• To create a file, /s3/data/june.txt:
• Look up <rootID, s3> in the edge table to get <s3ID>
• Look up <s3ID, data> in the edge table to get <dataID>
• Look up <dataID> in the inode table to get <dataID metadata>
• Update <dataID, dataID metadata> in the inode table
• Put <june.txtID, june.txt metadata> in the inode table
• Put <dataId, june.txt> in the edge table
• To list children of /:
• Prefix lookup of <rootId> in the edge table to get all <childID>s
• Look up each <childID> in the inode table to get <child metadata>

Eﬀects of the Inode Cache
• Generally can store up to 10M inodes
• Caching top levels of the Filesystem Tree greatly speeds up read
performance
• 20-50% performance loss when addressing a filesystem tree that does not
mostly fit into memory
• Writes can be buﬀered in the cache and are asynchronously flushed
to RocksDB
• No requirement for durability - that is handled by the journal

Additional & Future Work
• Fast startup time through using RocksDB checkpoints
• More sophisticated cache management policies

Conclusion
• RocksDB enables us to leverage oﬀheap storage
• Scales our raw metadata storage by an order of magnitude, allowing
us to address over 1 billion files
• Available in Alluxio 2.0 - Released June 27th 2019!

Questions?
Alluxio Website - https://www.alluxio.io
Alluxio Community Slack Channel - https://www.alluxio.io/slack
Alluxio Oﬀice Hours & Webinars - https://www.alluxio.io/events

https://github.com/tspannhw/SpeakerProfile/tree/main/2022/talks Fast Streaming into Clickhouse with Apache Pulsar https://github.com/tspannhw/FLiPC-FastStreamingIntoClickhouseWithApachePulsar https://www.meetup.com/San-Francisco-Bay-Area-ClickHouse-Meetup/events/285271332/ Fast Streaming into Clickhouse with Apache Pulsar - Meetup 2022 StreamNative - Apache Pulsar - Stream to Altinity Cloud - Clickhouse May the 4th Be With You! 04-May-2022 Clickhosue Meetup CREATE TABLE iotjetsonjson_local ( uuid String, camera String, ipaddress String, networktime String, top1pct String, top1 String, cputemp String, gputemp String, gputempf String, cputempf String, runtime String, host String, filename String, host_name String, macaddress String, te String, systemtime String, cpu String, diskusage String, memory String, imageinput String ) ENGINE = MergeTree() PARTITION BY uuid ORDER BY (uuid); CREATE TABLE iotjetsonjson ON CLUSTER '{cluster}' AS iotjetsonjson_local ENGINE = Distributed('{cluster}', default, iotjetsonjson_local, rand()); select uuid, top1pct, top1, gputempf, cputempf from iotjetsonjson where toFloat32OrZero(top1pct) > 40 order by toFloat32OrZero(top1pct) desc, systemtime desc select uuid, systemtime, networktime, te, top1pct, top1, cputempf, gputempf, cpu, diskusage, memory,filename from iotjetsonjson order by systemtime desc select top1, max(toFloat32OrZero(top1pct)), max(gputempf), max(cputempf) from iotjetsonjson group by top1 select top1, max(toFloat32OrZero(top1pct)) as maxTop1, max(gputempf), max(cputempf) from iotjetsonjson group by top1 order by maxTop1 Tim Spann Developer Advocate StreamNative

The Volcano/Cascades Optimizer

宇傅

Hive: Loading Data

Benjamin Leonhardi

Kafka Connect - debezium

Kasun Don

Introduction to Apache Calcite

Jordan Halterman

Hive + Tez: A Performance Deep DiveDataWorks Summit

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Databricks

Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.

Spark Summit EU talk by Mike Percy

Spark Summit

Flink Forward San Francisco 2022. In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you! by Olena Babenko

Apache Spark Architecture

Alexey Grishchenko

Massive Data Processing in Adobe Using Delta Lake

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu

HostedbyConfluent

"I will be presenting how we do the smart/automated capacity management on Multi-tenant Kafka cluster in Booking.com. It was a long journey. In this end to end story, I will be presenting what the issues were at the beginning, how we came up with a plan, designed, implemented, and applied to our existing clusters smoothly, now how the clients can monitor and even get alerted before their reserved capacity has been reached. What were the challenges and our learnings? What is next? Why? In Booking.com, the infra team manages 60 different Kafka clusters with hundreds of topics in each. There are clusters running with hundred brokers. As there are hundreds of Kafka clients from tens of different departments, it is high likely some of the clients start abusing the cluster. Especially during peak times, when the retention was set as retention.ms, or when the underlying message size changes, it is hard to predict what would be the occupied storage in total. Finding the relevant clients, deciding which data to discard, dealing with so many unknowns in a short period of time can be hassle. Also these are not fun activities but just a toil for the team. What? To avoid such boring issues, the team has chosen the path to build a smart mechanism and have quotas in place. It helped saving time developing new features instead of chasing people to resolve collisions. You can think that as an extension to the built-in throttling producer/consumer rate limits provided by the Apache Kafka, but it is much more than that. There are several components will be explained during the presentation one of them is our control plane (custom built) which manages the communication between clients and servers and does many things automated. Another one is the Custom Policies that we plugged in on the Kafka side to validate the configuration even tried (malicious configuration) on the server side. The talk guarantees learning and shows examples of Kafka at scale problems in Booking.com."

RocksDB Performance and Reliability Practices

Yoshinori Matsunobu

Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale. In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...

confluent

RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.

Log Structured Merge Tree

University of California, Santa Cruz

Common issues with Apache Kafka® Producer

confluent

Badai Aqrandista, Confluent, Senior Technical Support Engineer This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way! https://www.meetup.com/apache-kafka-sydney/events/279651982/

Automate Your Kafka Cluster with Kubernetes Custom Resources

confluent

(Sam Obeid, Shopify) Kafka Summit SF 2018 At Shopify we manage multiple Apache Kafka clusters in multiple locations in Google’s cloud platform. We deploy our Kafka clusters as Kubernetes StatefulSets, and we use other K8s workloads to implement different tasks. Automating critical and repetitive operational tasks is one of our top priorities. In this talk we’ll discuss how we leveraged Kubernetes Custom Resources and Controllers to automate some of the key cluster operational tasks, to detect clusters configuration changes and react to these changes with required actions. We will go through actual examples we implemented at Shopify, how we solved the problem of cluster discovery and how we automated topics creation across different clusters with zero human intervention and safety controls.

Tame the small files problem and optimize data layout for streaming ingestion...

Flink Forward

Flink Forward San Francisco 2022. In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling. by Gang Ye & Steven Wu

Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB

YugabyteDB

Building Robust ETL Pipelines with Apache Spark

Databricks

Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.

My first 90 days with ClickHouse.pdf

Alkin Tezuysal

This talk will tell the story of an analytics use case database from a non-OLAP and ACID-compliant RDBMS (MySQL) perspective. I will cover the basics of the Clickhouse database Sample Clickhouse installation in a lab environment. We are configuring Clickhouse for essential operations. We will load the sample data set and monitor it. We will query and visualize the results. This talk will also base on how Kubernetes can help Clickhouse implementation via an operator. Conclusions will include Do's and Don't of this emerging technology. Best practices and some advice around ingesting and analyzing terabytes of data efficiently.

Autoscaling Flink with Reactive Mode

Flink Forward

Flink Forward San Francisco 2022. Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo. by Robert Metzger

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...

confluent

Many companies are adopting Apache Kafka to power their data pipelines, including LinkedIn, Netflix, and Airbnb. Kafka’s ability to handle high throughput real-time data makes it a perfect fit for solving the data integration problem, acting as the common buffer for all your data and bridging the gap between streaming and batch systems. However, building a data pipeline around Kafka today can be challenging because it requires combining a wide variety of tools to collect data from disparate data systems. One tool streams updates from your database to Kafka, another imports logs, and yet another exports to HDFS. As a result, building a data pipeline can take significant engineering effort and has high operational overhead because all these different tools require ongoing monitoring and maintenance. Additionally, some of the tools are simply a poor fit for the job: the fragmented nature of the data integration tools ecosystem lead to creative but misguided solutions such as misusing stream processing frameworks for data integration purposes. We describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector. eventbrite_kafka_summit_event_logo_v3-035858-edited.png

Traversing Graphs with Gremlin

Artem Chebotko

The Parquet Format and Performance Optimization Opportunities

Databricks

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

Diving into the Deep End - Kafka Connect

confluent

Dennis Wittekind, Confluent, Senior Customer Success Engineer Perhaps you have heard of Kafka Connect and think it would be a great fit in your application's architecture, but you like to know how things work before you propose them to your team? Perhaps you know enough Connect to be dangerous, but you haven't had the time to really understand all the moving pieces? This meetup talk is for you! We'll briefly introduce Connect to the uninitiated, and then jump in to underlying concepts and considerations you should make when running Connect in production! We'll even run a live demo! What could go wrong!? https://www.meetup.com/Saint-Louis-Kafka-meetup-group/events/272687113/

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Flink Forward

Flink Forward San Francisco 2022. With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi. by Ethan Guo & Kyle Weller

Scalable and High available Distributed File System Metadata Service Using gR...

Scalable Filesystem Metadata Services with RocksDB

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scalable Filesystem Metadata Services with RocksDB

Similar to Scalable Filesystem Metadata Services with RocksDB (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Scalable Filesystem Metadata Services with RocksDB