The document discusses Cassandra's storage internals. It describes how Cassandra writes data to memtables and commit logs in memory before flushing to immutable SSTables on disk. It also explains how compaction merges SSTables to reclaim space and improve performance. For reads, Cassandra uses memtables, bloom filters on SSTables, key caches, and row caches to minimize disk I/O. Counters are implemented by coordinating writes across replicas.
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceCloudera, Inc.
Ā
The strength of an open source project resides entirely in its developer community; a strong democratic culture of participation and hacking makes for a better piece of software. The key requirement is having developers who are not only willing to contribute, but also knowledgeable about the projectās internal structure and architecture. This session will introduce developers to the core internal architectural concepts of HBase, not just āwhatā it does from the outside, but āhowā it works internally, and āwhyā it does things a certain way. Weāll walk through key sections of code and discuss key concepts like the MVCC implementation and memstore organization. The goal is to convert serious āHBase Usersā into HBase Developer Usersā, and give voice to some of the deep knowledge locked in the committersā heads.
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)jbellis
Ā
The document discusses dealing with JVM limitations in Apache Cassandra. It identifies key pain points like garbage collection and platform-specific code. It then explores specific issues like fragmentation and offers solutions like arena allocation for memtables. The document also advocates for allowing more low-level access in Java to directly address issues like file mapping limitations, in order to gain performance benefits even if it increases complexity or reduces portability.
HBase is an open source, distributed, column-oriented database modeled after Google's Bigtable that runs on top of Hadoop. The presenter discusses HBase's architecture, performance improvements in version 0.20 including major gains from new file formats and compression, and Stumbleupon's extensive use of HBase including supporting over 9 billion rows in a single table with high import and read speeds.
Since 2013, Yahoo! has been successfully running multi-tenant HBase clusters. Our tenants run applications ranging from real-time processing (e.g. content personalization, Ad targeting) to operational warehouses (e.g. advertising, content). Tenants are guaranteed an adequate level of resource isolation and security. This is achieved through the use of open source and in-house developed HBase features such as region server groups, group-based replication, and group-based favored nodes.
Today, with the increase in adoption and new use cases, we are working towards scaling our HBase clusters to support petabytes of data without compromising on performance and operability. A common tradeoff when scaling a cluster to this size is to increase the size of a region, thus avoiding the problem of having too many regions on a cluster. However, large regions negatively affect the performance and operability of a cluster mainly because region size determines the following: 1. granularity for load distribution, and 2. amount of write amplification due to compaction. Thus we are working towards enabling an HBase cluster to host at least a million regions.
In this presentation, we will walk through the key features we have implemented as well as share our experiences working on multi-tenancy and scaling the cluster.
The presentation provides you with the necessary steps to follow when migrating to XtraDB Cluster.
Percona provides an in-depth review of your database and recommends appropriate changes by performing a complete MySQL health check in which we identify inefficiencies, find problems before they occur, and ensure that your MySQL database is in the best condition.
Presentation from 2016 Austin OpenStack Summit.
The Ceph upstream community is declaring CephFS stable for the first time in the recent Jewel release, but that declaration comes with caveats: while we have filesystem repair tools and a horizontally scalable POSIX filesystem, we have default-disabled exciting features like horizontally-scalable metadata servers and snapshots. This talk will present exactly what features you can expect to see, what's blocking the inclusion of other features, and what you as a user can expect and can contribute by deploying or testing CephFS.
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceCloudera, Inc.
Ā
The strength of an open source project resides entirely in its developer community; a strong democratic culture of participation and hacking makes for a better piece of software. The key requirement is having developers who are not only willing to contribute, but also knowledgeable about the projectās internal structure and architecture. This session will introduce developers to the core internal architectural concepts of HBase, not just āwhatā it does from the outside, but āhowā it works internally, and āwhyā it does things a certain way. Weāll walk through key sections of code and discuss key concepts like the MVCC implementation and memstore organization. The goal is to convert serious āHBase Usersā into HBase Developer Usersā, and give voice to some of the deep knowledge locked in the committersā heads.
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)jbellis
Ā
The document discusses dealing with JVM limitations in Apache Cassandra. It identifies key pain points like garbage collection and platform-specific code. It then explores specific issues like fragmentation and offers solutions like arena allocation for memtables. The document also advocates for allowing more low-level access in Java to directly address issues like file mapping limitations, in order to gain performance benefits even if it increases complexity or reduces portability.
HBase is an open source, distributed, column-oriented database modeled after Google's Bigtable that runs on top of Hadoop. The presenter discusses HBase's architecture, performance improvements in version 0.20 including major gains from new file formats and compression, and Stumbleupon's extensive use of HBase including supporting over 9 billion rows in a single table with high import and read speeds.
Since 2013, Yahoo! has been successfully running multi-tenant HBase clusters. Our tenants run applications ranging from real-time processing (e.g. content personalization, Ad targeting) to operational warehouses (e.g. advertising, content). Tenants are guaranteed an adequate level of resource isolation and security. This is achieved through the use of open source and in-house developed HBase features such as region server groups, group-based replication, and group-based favored nodes.
Today, with the increase in adoption and new use cases, we are working towards scaling our HBase clusters to support petabytes of data without compromising on performance and operability. A common tradeoff when scaling a cluster to this size is to increase the size of a region, thus avoiding the problem of having too many regions on a cluster. However, large regions negatively affect the performance and operability of a cluster mainly because region size determines the following: 1. granularity for load distribution, and 2. amount of write amplification due to compaction. Thus we are working towards enabling an HBase cluster to host at least a million regions.
In this presentation, we will walk through the key features we have implemented as well as share our experiences working on multi-tenancy and scaling the cluster.
The presentation provides you with the necessary steps to follow when migrating to XtraDB Cluster.
Percona provides an in-depth review of your database and recommends appropriate changes by performing a complete MySQL health check in which we identify inefficiencies, find problems before they occur, and ensure that your MySQL database is in the best condition.
Presentation from 2016 Austin OpenStack Summit.
The Ceph upstream community is declaring CephFS stable for the first time in the recent Jewel release, but that declaration comes with caveats: while we have filesystem repair tools and a horizontally scalable POSIX filesystem, we have default-disabled exciting features like horizontally-scalable metadata servers and snapshots. This talk will present exactly what features you can expect to see, what's blocking the inclusion of other features, and what you as a user can expect and can contribute by deploying or testing CephFS.
With employees based in countries around the globe which provide 24x7 services to MySQL users worldwide, Percona provides enterprise-grade MySQL Support, Consulting, Training, Managed Services, and Server Development services to companies ranging from large organizations, such as Cisco Systems, Alcatel-Lucent, Groupon, and the BBC, to recent startups building MySQL-powered solutions for businesses and consumers.
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the "next generation" features
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
Ā
See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside
Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the ānext
generationā features. DocValues enable Lucene to efficiently store and retrieve type-safe Document
& Value pairs in a column stride fashion either entirely memory resident random access or disk
resident iterator based without the need to un-invert fields. Its final goal is to provide a
independently update-able per document storage for scoring, sorting or even filtering. This talk will
introduce the current state of development, implementation details, its features and how DocValues
have been integrated into Luceneās Codec API for full extendability.
Salvatore Sanfilippo ā How Redis Cluster works, and why - NoSQL matters Barce...NoSQLmatters
Ā
Salvatore Sanfilippo ā How Redis Cluster works, and why
In this talk the algorithmic details of Redis Cluster will be exposed in order to show what were the design tensions in the clustered version of an high performance database supporting complex data type, the selected tradeoffs, and their effect on the availability and consistency of the resulting solution.Other non-chosen solutions in the design space will be illustrated for completeness.
This document discusses techniques for improving latency in HBase. It analyzes the write and read paths, identifying sources of latency such as networking, HDFS flushes, garbage collection, and machine failures. For writes, it finds that single puts can achieve millisecond latency while streaming puts can hide latency spikes. For reads, it notes cache hits are sub-millisecond while cache misses and seeks add latency. GC pauses of 25-100ms are common, and failures hurt locality and require cache rebuilding. The document outlines ongoing work to reduce GC, use off-heap memory, improve compactions and caching to further optimize for low latency.
This document summarizes a presentation about PostgreSQL replication. It discusses different replication terms like master/slave and primary/secondary. It also covers replication mechanisms like statement-based and binary replication. The document outlines how to configure and administer replication through files like postgresql.conf and recovery.conf. It discusses managing replication including failover, failback, remastering and replication lag. It also covers synchronous replication and cascading replication setups.
Webinar Slides: Migrating to Galera ClusterSeveralnines
Ā
This document discusses considerations for migrating to Galera Cluster replication from MySQL or other database systems. It covers differences in supported features between Galera and MySQL, including storage engines, tables without primary keys, auto-increment handling, and DDL processing. It also addresses multi-master conflicts, long transactions, LOAD DATA processing, and using Galera with MySQL replication. An overview of online migration is provided along with guidance on validating schemas and checking for compatibility prior to migration.
HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the biggest and most exciting milestone release from the Apache community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Existing users of HBase/Phoenix as well as operators managing HBase clusters will benefit the most where they can learn about the new release and the long list of features. We will also briefly cover earlier 1.x release lines and compatibility and upgrade paths for existing users and conclude by giving an outlook on the next level of initiatives for the project.
The Google Chubby lock service for loosely-coupled distributed systemsRomain Jacotin
Ā
The Google Chubby lock service presented in 2006 is the inspiration for Apache ZooKeeper: let's take a deep dive into Chubby to better understand ZooKeeper and distributed consensus.
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
Ā
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!BertrandDrouvot
Ā
This document introduces the asm_metrics utility for monitoring Automatic Storage Management (ASM) metrics. The utility provides real-time ASM metrics like reads/writes per second and I/O times. It is customizable, allowing users to view metrics by ASM instance, database instance, diskgroup, or failgroup. The document provides several use cases for how admins can use asm_metrics to monitor I/O performance and balance across various ASM components.
DataStax: Extreme Cassandra Optimization: The SequelDataStax Academy
Ā
Al has been using Cassandra since version 0.6 and has spent the last few months doing little else but tune Cassandra clusters. In this talk, Al will show how to tune Cassandra for efficient operation using multiple views into system metrics, including OS stats, GC logs, JMX, and cassandra-stress.
This presentation provides an overview of the Dell PowerEdge R730xd server performance results with Red Hat Ceph Storage. It covers the advantages of using Red Hat Ceph Storage on Dell servers with their proven hardware components that provide high scalability, enhanced ROI cost benefits, and support of unstructured data.
Linux performance tuning & stabilization tips (mysqlconf2010)Yoshinori Matsunobu
Ā
This document provides tips for optimizing Linux performance and stability when running MySQL. It discusses managing memory and swap space, including keeping hot application data cached in RAM. Direct I/O is recommended over buffered I/O to fully utilize memory. The document warns against allocating too much memory or disabling swap completely, as this could trigger the out-of-memory killer to crash processes. Backup operations are noted as a potential cause of swapping, and adjusting swappiness is suggested.
The document discusses various techniques for performance tuning and cluster administration in HBase, including garbage collection tuning, use of memstore-local allocation buffers (MSLAB), enabling compression, optimizing splits and compactions through pre-splitting regions, and addressing hotspotting through manual splits. It provides guidance on configuring garbage collection, compression codecs, and approaches for managing splits and compactions to reduce disk I/O loads.
This document summarizes benchmark tests of NoSQL document databases using MongoDB. It compares the performance of MongoDB's MapReduce and Aggregation Framework on single node and sharded cluster configurations. The tests measured query response times for common aggregation operations like counting most frequently mentioned users or hashed tags. The results showed that the Aggregation Framework was roughly 2 times faster than MapReduce. Scaling out to a sharded cluster with multiple nodes initially did not improve performance. However, partitioning the data across multiple shards in a modest 3 node cluster showed better performance than a single node, with query times decreasing as more shards were added up to an optimal number.
This document discusses various goals, techniques, and solutions for replicating PostgreSQL databases. The goals covered are high availability, performance for reads and writes, supporting wide area networks, and handling offline peers. Techniques include master-slave and multi-master replication, proxies, and using standby systems. Specific solutions described are Slony-I, Slony-II, PGCluster, DBMirror, pgpool, WAL replication, Sequoia, DRBD, and shared storage. The document provides an overview of how each solution can help achieve different replication goals.
M|18 How to use MyRocks with MariaDB ServerMariaDB plc
Ā
MyRocks in MariaDB summarizes MyRocks, a storage engine for MariaDB that is based on RocksDB. It discusses how MyRocks addresses some of the limitations of InnoDB such as high write and space amplification. It provides details on installing and using MyRocks, including data loading techniques, tuning considerations, and replication support. Parallel replication is supported, but the highest isolation level is repeatable-read and row-based replication must be used.
This document discusses how Cassandra's storage engine was optimized for spinning disks but remains well-suited for solid state drives. It describes how Cassandra uses LSM trees with sequential, append-only writes to disks, avoiding the random read/write patterns that cause issues for SSDs like write amplification and reduced lifetime from excessive garbage collection. While SSDs have benefits like fast random access, Cassandra's design circumvents problems they were meant to solve, keeping write amplification close to 1 and leveraging SSDs' fast sequential throughput.
Speaker: Vladimir Rodionov (bigbase.org)
This talks introduces a totally new implementation of a multilayer caching in HBase called BigBase. BigBase has a big advantage over HBase 0.94/0.96 because of an ability to utilize all available server RAM in the most efficient way, and because of a novel implementation of a L3 level cache on fast SSDs. The talk will show that different type of caches in BigBase work best for different type of workloads, and that a combination of these caches (L1/L2/L3) increases the overall performance of HBase by a very wide margin.
The document provides an evaluation report of DaStor, a Cassandra-based data storage and query system. It summarizes the testbed hardware configuration including 9 nodes with 112 cores and 144GB RAM. It also describes the DaStor configuration, data schema for call detail records (CDR), storage architecture with indexing scheme, and benchmark results showing a throughput of around 80,000 write operations per second for the cluster.
With employees based in countries around the globe which provide 24x7 services to MySQL users worldwide, Percona provides enterprise-grade MySQL Support, Consulting, Training, Managed Services, and Server Development services to companies ranging from large organizations, such as Cisco Systems, Alcatel-Lucent, Groupon, and the BBC, to recent startups building MySQL-powered solutions for businesses and consumers.
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the "next generation" features
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
Ā
See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside
Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the ānext
generationā features. DocValues enable Lucene to efficiently store and retrieve type-safe Document
& Value pairs in a column stride fashion either entirely memory resident random access or disk
resident iterator based without the need to un-invert fields. Its final goal is to provide a
independently update-able per document storage for scoring, sorting or even filtering. This talk will
introduce the current state of development, implementation details, its features and how DocValues
have been integrated into Luceneās Codec API for full extendability.
Salvatore Sanfilippo ā How Redis Cluster works, and why - NoSQL matters Barce...NoSQLmatters
Ā
Salvatore Sanfilippo ā How Redis Cluster works, and why
In this talk the algorithmic details of Redis Cluster will be exposed in order to show what were the design tensions in the clustered version of an high performance database supporting complex data type, the selected tradeoffs, and their effect on the availability and consistency of the resulting solution.Other non-chosen solutions in the design space will be illustrated for completeness.
This document discusses techniques for improving latency in HBase. It analyzes the write and read paths, identifying sources of latency such as networking, HDFS flushes, garbage collection, and machine failures. For writes, it finds that single puts can achieve millisecond latency while streaming puts can hide latency spikes. For reads, it notes cache hits are sub-millisecond while cache misses and seeks add latency. GC pauses of 25-100ms are common, and failures hurt locality and require cache rebuilding. The document outlines ongoing work to reduce GC, use off-heap memory, improve compactions and caching to further optimize for low latency.
This document summarizes a presentation about PostgreSQL replication. It discusses different replication terms like master/slave and primary/secondary. It also covers replication mechanisms like statement-based and binary replication. The document outlines how to configure and administer replication through files like postgresql.conf and recovery.conf. It discusses managing replication including failover, failback, remastering and replication lag. It also covers synchronous replication and cascading replication setups.
Webinar Slides: Migrating to Galera ClusterSeveralnines
Ā
This document discusses considerations for migrating to Galera Cluster replication from MySQL or other database systems. It covers differences in supported features between Galera and MySQL, including storage engines, tables without primary keys, auto-increment handling, and DDL processing. It also addresses multi-master conflicts, long transactions, LOAD DATA processing, and using Galera with MySQL replication. An overview of online migration is provided along with guidance on validating schemas and checking for compatibility prior to migration.
HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the biggest and most exciting milestone release from the Apache community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Existing users of HBase/Phoenix as well as operators managing HBase clusters will benefit the most where they can learn about the new release and the long list of features. We will also briefly cover earlier 1.x release lines and compatibility and upgrade paths for existing users and conclude by giving an outlook on the next level of initiatives for the project.
The Google Chubby lock service for loosely-coupled distributed systemsRomain Jacotin
Ā
The Google Chubby lock service presented in 2006 is the inspiration for Apache ZooKeeper: let's take a deep dive into Chubby to better understand ZooKeeper and distributed consensus.
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
Ā
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!BertrandDrouvot
Ā
This document introduces the asm_metrics utility for monitoring Automatic Storage Management (ASM) metrics. The utility provides real-time ASM metrics like reads/writes per second and I/O times. It is customizable, allowing users to view metrics by ASM instance, database instance, diskgroup, or failgroup. The document provides several use cases for how admins can use asm_metrics to monitor I/O performance and balance across various ASM components.
DataStax: Extreme Cassandra Optimization: The SequelDataStax Academy
Ā
Al has been using Cassandra since version 0.6 and has spent the last few months doing little else but tune Cassandra clusters. In this talk, Al will show how to tune Cassandra for efficient operation using multiple views into system metrics, including OS stats, GC logs, JMX, and cassandra-stress.
This presentation provides an overview of the Dell PowerEdge R730xd server performance results with Red Hat Ceph Storage. It covers the advantages of using Red Hat Ceph Storage on Dell servers with their proven hardware components that provide high scalability, enhanced ROI cost benefits, and support of unstructured data.
Linux performance tuning & stabilization tips (mysqlconf2010)Yoshinori Matsunobu
Ā
This document provides tips for optimizing Linux performance and stability when running MySQL. It discusses managing memory and swap space, including keeping hot application data cached in RAM. Direct I/O is recommended over buffered I/O to fully utilize memory. The document warns against allocating too much memory or disabling swap completely, as this could trigger the out-of-memory killer to crash processes. Backup operations are noted as a potential cause of swapping, and adjusting swappiness is suggested.
The document discusses various techniques for performance tuning and cluster administration in HBase, including garbage collection tuning, use of memstore-local allocation buffers (MSLAB), enabling compression, optimizing splits and compactions through pre-splitting regions, and addressing hotspotting through manual splits. It provides guidance on configuring garbage collection, compression codecs, and approaches for managing splits and compactions to reduce disk I/O loads.
This document summarizes benchmark tests of NoSQL document databases using MongoDB. It compares the performance of MongoDB's MapReduce and Aggregation Framework on single node and sharded cluster configurations. The tests measured query response times for common aggregation operations like counting most frequently mentioned users or hashed tags. The results showed that the Aggregation Framework was roughly 2 times faster than MapReduce. Scaling out to a sharded cluster with multiple nodes initially did not improve performance. However, partitioning the data across multiple shards in a modest 3 node cluster showed better performance than a single node, with query times decreasing as more shards were added up to an optimal number.
This document discusses various goals, techniques, and solutions for replicating PostgreSQL databases. The goals covered are high availability, performance for reads and writes, supporting wide area networks, and handling offline peers. Techniques include master-slave and multi-master replication, proxies, and using standby systems. Specific solutions described are Slony-I, Slony-II, PGCluster, DBMirror, pgpool, WAL replication, Sequoia, DRBD, and shared storage. The document provides an overview of how each solution can help achieve different replication goals.
M|18 How to use MyRocks with MariaDB ServerMariaDB plc
Ā
MyRocks in MariaDB summarizes MyRocks, a storage engine for MariaDB that is based on RocksDB. It discusses how MyRocks addresses some of the limitations of InnoDB such as high write and space amplification. It provides details on installing and using MyRocks, including data loading techniques, tuning considerations, and replication support. Parallel replication is supported, but the highest isolation level is repeatable-read and row-based replication must be used.
This document discusses how Cassandra's storage engine was optimized for spinning disks but remains well-suited for solid state drives. It describes how Cassandra uses LSM trees with sequential, append-only writes to disks, avoiding the random read/write patterns that cause issues for SSDs like write amplification and reduced lifetime from excessive garbage collection. While SSDs have benefits like fast random access, Cassandra's design circumvents problems they were meant to solve, keeping write amplification close to 1 and leveraging SSDs' fast sequential throughput.
Speaker: Vladimir Rodionov (bigbase.org)
This talks introduces a totally new implementation of a multilayer caching in HBase called BigBase. BigBase has a big advantage over HBase 0.94/0.96 because of an ability to utilize all available server RAM in the most efficient way, and because of a novel implementation of a L3 level cache on fast SSDs. The talk will show that different type of caches in BigBase work best for different type of workloads, and that a combination of these caches (L1/L2/L3) increases the overall performance of HBase by a very wide margin.
The document provides an evaluation report of DaStor, a Cassandra-based data storage and query system. It summarizes the testbed hardware configuration including 9 nodes with 112 cores and 144GB RAM. It also describes the DaStor configuration, data schema for call detail records (CDR), storage architecture with indexing scheme, and benchmark results showing a throughput of around 80,000 write operations per second for the cluster.
Cache memory is used to improve processor performance by making main memory access appear faster. It works based on the principle of locality of reference, where programs tend to access the same data/instructions repeatedly. A cache hit provides faster access than main memory, while a miss requires retrieving data from main memory. Caches use mapping functions like direct, associative, or set-associative mapping to determine where to place blocks of data from main memory.
This document provides an overview of key-value stores Bigtable and Dynamo. It discusses their data models, APIs, consistency models, replication strategies, and architectures. Bigtable uses a column-oriented data model and provides strong consistency, while Dynamo sacrifices consistency for availability and flexibility through configurable consistency parameters. Both systems were designed for web-scale applications but take different approaches to meet different priorities like writes for Bigtable and availability for Dynamo.
What Every Developer Should Know About Database Scalabilityjbellis
Ā
Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955
This document discusses shingled magnetic recording (SMR) disks, which allow for higher storage densities by overlapping written tracks. SMR disks can achieve 2-3 times the density of conventional disks but only support random reads and sequential writes. The document explores two strategies for utilizing SMR disks: 1) masking their operational differences behind a translation layer or 2) using a specialized file system optimized for their characteristics. Key challenges include supporting random writes, managing bands of tracks, and reserving space for random access versus storing in non-volatile RAM. Workload analysis is needed to determine suitability for general usage.
This document discusses a distributed database called Acunu that is tunably consistent, highly available, and partition tolerant. It can scale out on commodity servers and provides high performance. The database uses a multi-master architecture without single points of failure and supports data replication across multiple data centers. It also provides a simple but powerful data model and is well-suited for applications involving high-velocity data.
An explanation of how silicon-wafer-totting droids translate to a read+write operation of an SSD device. Presented at Sydney VMUG 2012Q2.
Best downloaded and played with Keynote for animations.
This document provides an overview of Redis including:
- Basic data structures like strings, lists, sets, sorted sets, and hashes
- Common commands for each data type
- Internal implementation details like ziplists, dictionaries, and skip lists
- Additional features like pub/sub, transactions, replication, persistence, and virtual memory
- Examples of Redis applications and how to contribute code to the Redis project
SSDs, IMDGs and All the Rest - Jax LondonUri Cohen
Ā
This document discusses how SSDs are improving data processing performance compared to HDDs and memory. It provides numbers showing SSDs have faster access times than HDDs but slower than memory. It also explains some of the challenges of SSDs like limited write cycles and that updates require erasing entire blocks. It discusses how databases like Cassandra and technologies like flash caching are optimized for SSDs, but there is still room for improvement like reducing read path complexity and write amplification. The document advocates for software optimizations to directly access SSDs and reduce overhead to further improve performance.
How to randomly access data in close-to-RAM speeds but a lower cost with SSDā...JAXLondon2014
Ā
This document discusses how SSDs are improving data processing performance compared to HDDs and memory. It outlines the performance differences between various storage levels like registers, caches, RAM, SSDs, and HDDs. It then discusses some of the challenges with SSDs related to their NAND chip architecture and controllers. It provides examples of how databases like Cassandra and MySQL can be optimized for SSD performance characteristics like sequential writes. The document argues that software needs to better utilize direct SSD access and trim commands to maximize performance.
The document discusses various aspects of computer memory systems including main memory, cache memory, and memory mapping techniques. It provides details on:
1) Main memory stores program and data during execution and consists of addressable memory cells. Memory access time is the time for a memory operation while cycle time is the minimum delay between operations.
2) Memory units include RAM, ROM, PROM, EPROM, EEPROM and flash memory which have different characteristics like volatility and ability to be written.
3) Cache memory uses fast SRAM to improve performance by taking advantage of locality of reference where nearby memory accesses are common. Mapping techniques like direct, associative and set-associative mapping determine how
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
Ā
Apache Flink is a world class stateful stream processor presents a huge variety of optional features and configuration choices to the user. Determining out the optimal choice for any production environment and use-case be challenging. In this talk, we will explore and discuss the universe of Flink configuration with respect to state and state backends.
We will start with a closer look under the hood, at core data structures and algorithms, to build the foundation for understanding the impact of tuning parameters and the costs-benefit-tradeoffs that come with certain features and options. In particular, we will focus on state backend choices (Heap vs RocksDB), tuning checkpointing (incremental checkpoints, ...) and recovery (local recovery), serializers and Apache Flink's new state migration capabilities.
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...peknap
Ā
Reducing memory usage is well covered in the history of this conference, yet new tricks still do exist. When optimizing memory footprint for an home gateway device, the author found some unexpected places where small changes can save valuable amount of DRAM or Flash space. This talk will visit different areas including - Kernel: fragmentation threshold, page frame reclamation task and atomic memory. Application level: Memory inefficient shared libraries due to ABI compliance and dynamic loading. Toolchain: Tuning malloc allocator parameters and compiler options. System level: General kernel might be more memory efficient than MMU-less uClinux, and preventing lock up when the system is on the brink of running out of memory.
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...DataStax Academy
Ā
Speaker(s): Kathryn Erickson, Engineering at DataStax
During this session we will discuss varying recommended hardware configurations for DSE. Weāll get right to the point and provide quick and solid recommendations up front. After we get the main points down take a brief tour of the history of database storage and then focus on designing a storage subsystem that won't let you down.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
BigBase is a read-optimized version of HBase NoSQL data store and is FULLY, 100% HBase compatible. 100% compatibility means that the upgrade from HBase to BigBase and other way around does not involve data migration and even can be made without stopping the cluster (via rolling restart).
This document proposes a flash-based caching scheme called Flash as Cache Extension (FaCE) to improve database performance and recovery time. FaCE caches both clean and dirty database pages in SSDs on database transaction commit. It uses a write-optimized design with sequential writes to SSDs and a write-back synchronization policy. FaCE also leverages the non-volatility of flash caches to support faster database recovery by reading cached pages from SSDs instead of disks. Evaluation shows FaCE achieves over 3x higher throughput than existing schemes and 4x faster recovery time.
This document provides an overview of Cassandra's read and write paths. It describes the core components involved, including memtables, SSTables, commitlog, cache service, column family store, and more. It explains how writes are applied to the commitlog and memtable and how reads merge data from memtables and SSTables using the collation controller.
Similar to Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix (20)
Acunu and Hailo: a realtime analytics case study on CassandraAcunu
Ā
Hailo is a taxi app that receives a hail every 4 seconds across 15 cities. It launched on AWS using MySQL but adopted Cassandra and Acunu for greater resilience during international expansion. Cassandra provided high availability and global replication. Acunu provided analytics capabilities on Cassandra data. Hailo uses Cassandra for entity storage and Acunu for analytics, seeing benefits like simplified data modeling, rich queries, and infrastructure monitoring. Choosing these platforms allowed for high availability, multi-data center operation, and scaling to support growth.
- Cassandra nodes are clustered in a ring, with each node assigned a random token range to own.
- Adding or removing nodes traditionally required manually rebalancing the token ranges, which was complex, impacted many nodes, and took the cluster offline.
- Virtual nodes assign each physical node multiple random token ranges of varying sizes, allowing incremental changes where new nodes "steal" ranges from others, distributing the load evenly without manual work or downtime.
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu
Ā
Hailo, the taxi app, has served more than 5 million passengers in 15 cities and has taken fares of $100 million this year. I'm going to talk about how that rapid growth has been powered by a platform based on Cassandra and operational analytics and insights powered by Acunu Analytics. I'll cover some challenges and lessons learned from scaling fast!
Understanding Cassandra internals to solve real-world problemsAcunu
Ā
The document summarizes Nicolas Favre-Felix's presentation on Cassandra internals at a Cassandra London meetup. It discusses four common problems encountered with Cassandra - high read latency, high CPU usage with little activity, long nodetool repair times, and optimizing write throughput. For each problem, it describes symptoms, analysis using tools like nodetool, and solutions like adjusting the data model, increasing thread pool sizes, and adding hardware resources. The key takeaways are that monitoring Cassandra is important, using the right data model impacts performance, and understanding how Cassandra stores and arranges data on disk is essential to optimization.
Talk for the Cassandra Seattle Meetup April 2013: http://www.meetup.com/cassandra-seattle/events/114988872/
Cassandra's got some properties which make it an ideal fit for building real-time analytics applications -- but getting from atomic increments to live dashboards and streaming queries is quite a stretch. In this talk, Tim Moreton, CTO at Acunu, talks about how and why they built Acunu Analytics, which adds rich SQL-like queries and a RESTful API on top of Cassandra, and looks at how it keeps Cassandra's spirit of denormalization under the hood.
The document describes how Apache Cassandra can be used for real-time analytics on streaming data. It provides an example of counting Twitter mentions of a term per day in real-time by incrementing counters in Cassandra as tweets are processed. This allows queries to be answered by reading the counters. More complex queries can be supported by storing aggregated data in a denormalized format across rows and columns in Cassandra.
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Acunu
Ā
The document discusses implementing real-time analytics on Twitter data using Cassandra. It describes incrementing counters for each tweet to track token frequencies over time. This allows querying token mentions within a date range by reading the relevant counter columns. However, Cassandra's random partitioner prevents efficient range queries on rows. Instead, the solution denormalizes the data into wide rows with time buckets as columns to allow fast counting of token mentions within each time period through a single disk read. The document provides code examples and encourages experimenting with an open source implementation.
This document discusses real-time analytics with Cassandra. It includes sections on motivation/alternatives, what real-time analytics with Cassandra is, how it works, approximate analytics, and what problems it can help solve. The document contains log data as an example of the type of data that can be analyzed with this technique.
- The document discusses Acunu Analytics, a real-time big data analytics platform.
- It addresses the motivation for developing Acunu Analytics compared to alternatives. It also briefly describes what Acunu Analytics is, how it works, and what problems it can help solve.
- The main topics covered are the product itself, its capabilities for real-time analytics of big data, and potential use cases.
Realtime Analytics on the Twitter Firehose with CassandraAcunu
Ā
This document discusses using Cassandra for real-time analytics of Twitter data. It describes incrementing counters in Cassandra as tweets are processed to track metrics like mentions over time. This allows queries to retrieve trends by reading counters with a single I/O, rather than scanning large amounts of data. The document demonstrates preparing tweet data by tokenizing and incrementing counters in time buckets. It also covers implementing a range query to retrieve mentions between dates from a wide row with time buckets as columns.
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
Ā
The document discusses NoSQL, NewSQL, and other database technologies that are emerging to address limitations of relational databases in scaling to meet demands for performance, availability, and flexibility. It provides an overview of different categories of NoSQL databases and NewSQL solutions, and analyzes drivers like scalability, performance, relaxed consistency, agility, and complexity of data that are contributing to adoption of these new database approaches.
Cassandra EU 2012 - Putting the X Factor into CassandraAcunu
Ā
Malcolm Box discusses Tellybug's experience using Cassandra to power voting applications for reality TV shows like Britain's Got Talent and The X Factor. They started with Cassandra to handle high write loads from millions of votes but found counting to be more challenging than expected. They implemented sharded counters in Memcached with Cassandra as the source of truth. While Cassandra scaled well for writes, reads had performance issues. Backup and data integrity also presented operational challenges as their usage of Cassandra evolved.
Acunu is developing an enterprise Cassandra appliance called Castle that aims to simplify Cassandra deployment and management. Castle includes a storage engine optimized for large disks and workloads, and allows for high density on commodity hardware. It also features fast disk rebuilds through its shared memory architecture. Acunu provides a web UI called the Control Center to configure, monitor, and troubleshoot Castle without deep Cassandra expertise. Acunu performs extensive automated testing of Castle to ensure reliability.
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Acunu
Ā
The document discusses the history and development of Cassandra Query Language (CQL), which provides an SQL-like interface for querying Apache Cassandra databases. It describes CQL evolving from versions 1.0 through 3.0 to become more standardized and user-friendly. Key points include CQL initially being introduced in Cassandra 0.8 to replace the low-level Thrift API, its goals of being simple, intuitive, and high performing, and ongoing work to improve its interface stability and driver support across languages.
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Acunu
Ā
This document summarizes a presentation about Cassandra's highly available distributed data model. The presentation covers Cassandra's key capabilities of scalability, fault tolerance, tunable consistency, and replication without single points of failure. It discusses Cassandra's use of consistent hashing to partition and place data across nodes, as well as its replication strategies and consistency levels that allow tuning availability versus consistency.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Ā
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Ā
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
Fueling AI with Great Data with Airbyte WebinarZilliz
Ā
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Ā
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
Ā
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
Ā
An English š¬š§ translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech šØšæ version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
"Choosing proper type of scaling", Olena SyrotaFwdays
Ā
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Ā
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Ā
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
Weāll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
āHow Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...Edge AI and Vision Alliance
Ā
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/how-axelera-ai-uses-digital-compute-in-memory-to-deliver-fast-and-energy-efficient-computer-vision-a-presentation-from-axelera-ai/
Bram Verhoef, Head of Machine Learning at Axelera AI, presents the āHow Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-efficient Computer Visionā tutorial at the May 2024 Embedded Vision Summit.
As artificial intelligence inference transitions from cloud environments to edge locations, computer vision applications achieve heightened responsiveness, reliability and privacy. This migration, however, introduces the challenge of operating within the stringent confines of resource constraints typical at the edge, including small form factors, low energy budgets and diminished memory and computational capacities. Axelera AI addresses these challenges through an innovative approach of performing digital computations within memory itself. This technique facilitates the realization of high-performance, energy-efficient and cost-effective computer vision capabilities at the thin and thick edge, extending the frontier of what is achievable with current technologies.
In this presentation, Verhoef unveils his companyās pioneering chip technology and demonstrates its capacity to deliver exceptional frames-per-second performance across a range of standard computer vision networks typical of applications in security, surveillance and the industrial sector. This shows that advanced computer vision can be accessible and efficient, even at the very edge of our technological ecosystem.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Ā
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Donāt worry, we can help with all of this!
Weāll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. Weāll provide examples and solutions for those as well. And naturally weāll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
2. What this talk covers
ā¢ What happens within a Cassandra node
ā¢ How Cassandra reads and writes data
ā¢ What compaction is and why we need it
ā¢ How counters are stored, modiļ¬ed, and read
4. Why is this important?
ā¢ Understand what goes on under the hood
ā¢ Understand the reasons for these choices
ā¢ Diagnose issues
ā¢ Tune Cassandra for performance
ā¢ Make your data model efļ¬cient
6. A word about hard drives
ā¢ Main driver behind Cassandraās storage choices
ā¢ The last moving part
ā¢ Fast sequential I/O (150 MB/s)
ā¢ Slow random I/O (120-200 IOPS)
7. What SSDs bring
ā¢ Fast sequential I/O
ā¢ Fast random I/O
ā¢ Higher cost
ā¢ Limited lifetime
ā¢ Performance degradation
8. Disk usage with B-trees
ā¢ Important data structure in relational databases
ā¢ In-place overwrites (random I/O)
ā¢ LogB(N) random accesses for reads and writes
9. Disk usage with Cassandra
ā¢ Made for spinning disks
ā¢ Sequential writes, much less than 1 I/O per insert
ā¢ Several layers of cache
ā¢ Random reads, approximately 1 I/O per read
ā¢ Generally āwrite-optimisedā
13. The Commit Log
ā¢ Each write is added to a log ļ¬le
ā¢ Guarantees durability after a crash
ā¢ 1-second window during which data is still in RAM
ā¢ Sequential I/O
ā¢ A dedicated disk is recommended
14. Memtables
ā¢ In-memory Key/Value data structure
ā¢ Implemented with ConcurrentSkipListMap
ā¢ One per column family
ā¢ Very fast inserts
ā¢ Columns are merged in memory for the same key
ā¢ Flushed at a certain threshold, into an SSTable
16. Dumping a Memtable on disk
In the JVM New Memtable
Commit
On disk log
SSTable
17. The SSTable
ā¢ One ļ¬le, written sequentially
ā¢ Columns are in order, grouped by row
ā¢ Immutable once written, no updates!
18. SSTables start piling up!
In the JVM Memtable
Commit log SSTable SSTable SSTable
On disk SSTable SSTable SSTable
SSTable SSTable SSTable
SSTable SSTable SSTable
19. SSTables
ā¢ Canāt keep all of them forever
ā¢ Need to reclaim disk space
ā¢ Reads could touch several SSTables
ā¢ Scans touch all of them
ā¢ In-memory data structures per SSTable
21. Compaction
ā¢ Merge SSTables of similar size together
ā¢ Remove overwrites and deleted data (timestamps)
ā¢ Improve range query performance
ā¢ Major compaction creates a single SSTable
ā¢ I/O intensive operation
22. Recent improvements
ā¢ Pluggable compaction
ā¢ Different strategies, chosen per column family
ā¢ SSTable compression
ā¢ More efļ¬cient SSTable merges
24. Reading from Cassandra
ā¢ Reading all these SSTables would be very inefļ¬cient
ā¢ We have to read from memory as much as possible
ā¢ Otherwise we need to do 2 things efļ¬ciently:
ā¢ Find the right SSTable to read from
ā¢ Find where in that SSTable to read the data
25. First step for reads
ā¢ The Memtable!
ā¢ Read the most recent data
ā¢ Very fast, no need to touch the disk
26. Off-heap (no GC) Row cache
In the JVM Memtable
Commit
On disk log
SSTable
27. Row cache
ā¢ Stores a whole row in memory
ā¢ Off-heap, not subject to Garbage Collection
ā¢ Size is conļ¬gurable per column family
ā¢ Last resort before having to read from disk
28. Finding the right SSTable
In the JVM Memtable
Commit log SSTable SSTable
On disk SSTable SSTable SSTable
SSTable SSTable SSTable SSTable
29. Bloom ļ¬lter
ā¢ Saved with each SSTable
ā¢ Answers ācontains(Key) :: booleanā
ā¢ Saved on disk but kept in memory
ā¢ Probabilistic data structure
ā¢ Conļ¬gurable proportion of false positives
ā¢ No false negatives
30. Bloom ļ¬lter
In the JVM Memtable
exists(key)?
Bloom ļ¬lter Bloom ļ¬lter Bloom ļ¬lter
true/false
Commit
On disk log
SSTable SSTable SSTable
31. Reading from an SSTable
ā¢ We need to know where in the ļ¬le our data is saved
ā¢ Keys are sorted, why donāt we do a binary search?
ā¢ Keys are not all the same size
ā¢ Jumping around in a ļ¬le is very slow
ā¢ Log2(N) random I/O, ~20 for 1 million keys
32. Reading from an SSTable
Letās index key ranges in the SSTable
Key: k-128 Key: k-256 Key: k-384
Position: 12098 Position: 23445 Position: 43678
SSTable
33. SSTable index
ā¢ Saved with each SSTable
ā¢ Stores key ranges and their offsets: [(Key, Offset)]
ā¢ Saved on disk but kept in memory
ā¢ Avoids searching for a key by scanning the ļ¬le
ā¢ Conļ¬gurable key interval (default: 128)
34. SSTable index
In the JVM Memtable
SSTable
Bloom ļ¬lter
index
Commit
On disk log
SSTable
35. Sometimes not enough
ā¢ Storing key ranges is limited
ā¢ We can do better by storing the exact offset
ā¢ This saves approximately one I/O
36. The key cache
In the JVM Memtable
SSTable
Bloom ļ¬lter Key cache
index
Commit
On disk log
SSTable
37. Key cache
ā¢ Stores the exact location in the SSTable
ā¢ Stored in heap
ā¢ Avoids having to scan a whole index interval
ā¢ Size is conļ¬gurable per column family
38. 2
Off-heap (no GC) Row cache
1
In the JVM Memtable
3 4 5
SSTable
Bloom ļ¬lter Key cache
index
6
Commit
On disk log
SSTable
39. 2
Off-heap (no GC) Row cache
1
In the JVM Memtable
3 4 5
SSTable
Bloom ļ¬lter Key cache
index
6
Commit
On disk log
SSTable
40. 2
Off-heap (no GC) Row cache
1
In the JVM Memtable
3 4 5
SSTable
Bloom ļ¬lter Key cache
index
6
Commit
On disk log
SSTable
41. 2
Off-heap (no GC) Row cache
1
In the JVM Memtable
3 4 5
SSTable
Bloom ļ¬lter Key cache
index
6
Commit
On disk log
SSTable
42. 2
Off-heap (no GC) Row cache
1
In the JVM Memtable
3 4 5
SSTable
Bloom ļ¬lter Key cache
index
6
Commit
On disk log
SSTable
43. 2
Off-heap (no GC) Row cache
1
In the JVM Memtable
3 4 5
SSTable
Bloom ļ¬lter Key cache
index
6
Commit
On disk log
SSTable
44. 2
Off-heap (no GC) Row cache
1
In the JVM Memtable
3 4 5
SSTable
Bloom ļ¬lter Key cache
index
6
Commit
On disk log
SSTable
46. Distributed counters
ā¢ 64-bit signed integer, replicated in the cluster
ā¢ Atomic inc and dec by an arbitrary amount
ā¢ Counting with read-inc-write would be inefļ¬cient
ā¢ Stored differently from regular columns
48. Internal counter data
ā¢ List of increments received by the local node
ā¢ Summaries (Version,Sum) sent by the other nodes
ā¢ The total value is the sum of all counts
49. Internal counter data
ā¢ List of increments received by the local node
ā¢ Summaries (Version,Sum) sent by the other nodes
ā¢ The total value is the sum of all counts
Local increments +5 +2 -3
node version: 3
Received from count: 5
version: 5
Received from count: 10
51. Incrementing a counter
ā¢ A coordinator node is chosen
ā¢ Stores its increment locally
Local increments +5 +2 -3 +1
52. Incrementing a counter
ā¢ A coordinator node is chosen
ā¢ Stores its increment locally
ā¢ Reads back the sum of its increments
Local increments +5 +2 -3 +1
53. Incrementing a counter
ā¢ A coordinator node is chosen
ā¢ Stores its increment locally
ā¢ Reads back the sum of its increments
ā¢ Forwards a summary to other replicas: (v.4, sum 5)
Local increments +5 +2 -3 +1
54. Incrementing a counter
ā¢ A coordinator node is chosen
ā¢ Stores its increment locally
ā¢ Reads back the sum of its increments
ā¢ Forwards a summary to other replicas
ā¢ Replicas update their records:
version: 4
Received from count: 5
55. Reading a counter
ā¢ Replicas return their counts and versions
ā¢ Including what they know about other nodes
ā¢ Only the most recent versions are kept
60. Tuning
ā¢ Cassandra canāt really use large amounts of RAM
ā¢ Garbage Collection pauses stop everything
ā¢ Compaction has an impact on performance
ā¢ Reading from disk is slow
ā¢ These limitations restrict the size of each node
61. Recap
ā¢ Fast sequential writes
ā¢ ~1 I/O for uncached reads, 0 for cached
ā¢ Counter increments read on write, columns donāt
ā¢ Know where your time is spent (monitor!)
ā¢ Tune accordingly
63. ā¢ In-kernel backend
ā¢ No Garbage Collection
ā¢ No need to plan heavy compactions
ā¢ Low and consistent latency
ā¢ Full versioning, snapshots
ā¢ No degradation with Big Data