This document discusses storing data on disks and in files for database management systems. It covers several key topics:
1) The memory hierarchy from main memory to disks and tapes and why databases must store most data on disk for capacity and cost reasons.
2) Disk drive architecture including how data is stored, read, and written in blocks and the implications for performance like seek times and rotational delays.
3) File structure including heap files, how records and pages are organized on disk, and the impact of layout on performance through aspects like locality of reference.
Database High Availability Using SHADOW SystemsJaemyung Kim
The document presents SHADOW, a system for providing high availability of databases using shared persistent storage in cloud environments. SHADOW maintains a single logical copy of the database and log across active and standby database management systems. It pushes the responsibility for data replication into the underlying storage tier. This decouples database replication from replication at the DBMS level. The document evaluates SHADOW against standalone and synchronous replication baselines using the TPC-C benchmark. SHADOW outperforms the other approaches and provides more stable throughput over time by offloading write operations and exploiting data sharing across nodes. However, its replication is limited by the scope of the shared storage.
OSDC 2012 | Taking hot backups with XtraBackup by Alexey KopytovNETWAYS
Percona XtraBackup is an open source hot backup tool for MySQL. This talk will cover basic operations such as taking full and incremental backups, restoring from backups and setting up replication slaves, as well as advanced and recently added features such as streaming and compressed backups, parallel operations and partial backups.
Ceph - High Performance Without High CostsJonathan Long
Ceph is a high-performance storage platform that provides storage without high costs. The presentation discusses BlueStore, a redesign of Ceph's object store to improve performance and efficiency. BlueStore preserves wire compatibility but uses an incompatible storage format. It aims to double write performance and match or exceed read performance of the previous FileStore design. BlueStore simplifies the architecture and uses algorithms tailored for different hardware like flash. It was in a tech preview in the Jewel release and aims to be default in the Luminous release next year.
This presentation breaks down the Aerospike Key Value Data Access. It covers the topics of Structured vs Unstructured Data, Database Hierarchy & Definitions as well as Data Patterns.
This document discusses techniques for optimizing data loading and unloading performance in Oracle databases. It describes how to use external tables with compression and parallelism to load data very quickly at over 300 MB/s. For unloading, it recommends using parallel table functions and compression to achieve similar high speeds. Various monitoring tools are also covered, including OS-level tools like vmstat and iostat as well as database-level tools like AWR and SQL Monitor reports. Live demonstrations are provided of loading and unloading 1 TB of data and using monitoring reports to analyze query performance.
The document discusses PostgreSQL's write-ahead log (WAL), which records database changes before writing them to disk for crash safety. The WAL allows for features like online backups by archiving WAL records, point-in-time recovery by restoring from backups and replaying WAL, and replication by transmitting WAL to standby servers. It works by writing each change as a WAL record before updating data pages, and replaying the log during recovery to reconstruct unfinished transactions after a crash.
Database High Availability Using SHADOW SystemsJaemyung Kim
The document presents SHADOW, a system for providing high availability of databases using shared persistent storage in cloud environments. SHADOW maintains a single logical copy of the database and log across active and standby database management systems. It pushes the responsibility for data replication into the underlying storage tier. This decouples database replication from replication at the DBMS level. The document evaluates SHADOW against standalone and synchronous replication baselines using the TPC-C benchmark. SHADOW outperforms the other approaches and provides more stable throughput over time by offloading write operations and exploiting data sharing across nodes. However, its replication is limited by the scope of the shared storage.
OSDC 2012 | Taking hot backups with XtraBackup by Alexey KopytovNETWAYS
Percona XtraBackup is an open source hot backup tool for MySQL. This talk will cover basic operations such as taking full and incremental backups, restoring from backups and setting up replication slaves, as well as advanced and recently added features such as streaming and compressed backups, parallel operations and partial backups.
Ceph - High Performance Without High CostsJonathan Long
Ceph is a high-performance storage platform that provides storage without high costs. The presentation discusses BlueStore, a redesign of Ceph's object store to improve performance and efficiency. BlueStore preserves wire compatibility but uses an incompatible storage format. It aims to double write performance and match or exceed read performance of the previous FileStore design. BlueStore simplifies the architecture and uses algorithms tailored for different hardware like flash. It was in a tech preview in the Jewel release and aims to be default in the Luminous release next year.
This presentation breaks down the Aerospike Key Value Data Access. It covers the topics of Structured vs Unstructured Data, Database Hierarchy & Definitions as well as Data Patterns.
This document discusses techniques for optimizing data loading and unloading performance in Oracle databases. It describes how to use external tables with compression and parallelism to load data very quickly at over 300 MB/s. For unloading, it recommends using parallel table functions and compression to achieve similar high speeds. Various monitoring tools are also covered, including OS-level tools like vmstat and iostat as well as database-level tools like AWR and SQL Monitor reports. Live demonstrations are provided of loading and unloading 1 TB of data and using monitoring reports to analyze query performance.
The document discusses PostgreSQL's write-ahead log (WAL), which records database changes before writing them to disk for crash safety. The WAL allows for features like online backups by archiving WAL records, point-in-time recovery by restoring from backups and replaying WAL, and replication by transmitting WAL to standby servers. It works by writing each change as a WAL record before updating data pages, and replaying the log during recovery to reconstruct unfinished transactions after a crash.
This document discusses best practices for optimizing SQL Server performance. It recommends establishing a baseline, identifying bottlenecks, making one change at a time and measuring the impact. It also provides examples of metrics, tools and techniques to monitor performance at the system, database and query levels. These include Windows Performance Monitor, SQL Server Activity Monitor, Dynamic Management Views and trace flags.
ZFS is a filesystem developed for Solaris that provides features like cheap snapshots, replication, and checksumming. It can be used for databases. While it has benefits, random writes become sequential which can hurt performance. The OpenZFS project continues developing ZFS and improved the I/O scheduler to provide smoother write latency compared to the original ZFS write throttle. Tuning parameters in OpenZFS give better control over throughput and latency. Measuring performance is important for optimizing ZFS for database use.
Less is More: 2X Storage Efficiency with HDFS Erasure CodingZhe Zhang
Ever since its creation, HDFS has been relying on data replication to shield against most failure scenarios. However, with the explosive growth in data volume, replication is getting quite expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). Erasure coding (EC) uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication.
The document discusses direct access to Oracle's shared memory (SGA) using C code. It describes the main regions of the SGA, how information is used automatically and by queries, and reasons for direct access such as reading hidden information or during database hangs. It outlines examining the SGA contents through externalized X$ tables, with most structures in the SGA not directly visible. The document provides a procedure to summarize waits from the V$SESSION_WAIT view by mapping it to the underlying X$KSUSECST table fields and offsets.
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Altinity Ltd
ClickHouse is a powerful open source analytics database that provides fast, scalable performance for data warehousing and real-time analytics use cases. It can handle petabytes of data and queries and scales linearly on commodity hardware. ClickHouse is faster than other databases for analytical workloads due to its columnar data storage and parallel processing. It supports SQL and integrates with various data sources. ClickHouse can run on-premises, in the cloud, or in containers. The ClickHouse operator makes it easy to deploy and manage ClickHouse clusters on Kubernetes.
Enable Symantec Backup Exec™ to store and retrieve backup data to Cloud stora...TwinStrata
The following presentation provides a functional overview on:
How Symantec Backup Exec software together with CloudArray work seamlessly to protect virtualized and physical environments
This solution provides an economic and highly reliable storage tier for offsite data backup and replication without the logistics and cost of tape transport.
The presentation also provides a 3-year TCO summary on CloudArray vs. existing backup and replication solutions
Column store indexes and batch processing mode (nx power lite)Chris Adkin
This document discusses SQL Server performance tuning with a focus on leveraging CPU caches through column store compression. It explains how column store compression can bridge the performance gap between IO subsystems and modern processors by breaking data through levels of compression to pipeline batches into CPU caches. Examples are provided showing significant performance improvements from column store compression and clustering over row-based storage and no compression.
Out of the box replication in postgres 9.4(pg confus)Denish Patel
This document contains notes from a presentation on PostgreSQL replication. It discusses write-ahead logs (WAL), replication history in PostgreSQL from versions 7.0 to 9.4, how to set up basic replication, tools for backups and monitoring replication, and demonstrates setting up replication without third party tools using pg_basebackup, replication slots, and pg_receivexlog. It also includes contact information for the presenter and an invitation to join the PostgreSQL Slack channel.
The document discusses various strategies for backing up and recovering PostgreSQL databases. It begins by introducing the speaker and their background. It then covers the objectives of business continuity planning. The main types of backups discussed are logical backups using pg_dump and physical/file-system level backups. Advantages and disadvantages of each approach are provided. Validation of backups through continuous testing of restores is emphasized. Automating the backup process through scripting and configuration management tools is presented as a best practice. Specific tools discussed include pg_basebackup, ZFS snapshots for file-system level backups, Barman and WAL-E for third party managed backups, and examples of custom in-house solutions.
The document discusses table partitioning and sharding in PostgreSQL as approaches to improve performance and scalability as data volumes grow over time. Table partitioning involves splitting a master table into multiple child tables or partitions based on a partition function to distribute data. Sharding distributes partitions across multiple database servers. The document provides steps to implement table partitioning and sharding in PostgreSQL using the Citus extension to distribute a sample sales table across a master and worker node.
In-core compression: how to shrink your database size in several timesAleksander Alekseev
The document discusses techniques for compressing database size in Postgres, including:
1. Using in-core block-level compression as a feature of Postgres Pro EE to shrink database size by several times.
2. The ZSON extension provides transparent JSONB compression by replacing common strings with 16-bit codes and compressing the data.
3. Various schema optimizations like proper data types, column ordering, and packing data can reduce size by improving storage layout and enabling TOAST compression.
This document provides an overview and summary of BlueStore, a new storage backend for Ceph that provides faster performance compared to the previous FileStore backend. BlueStore uses a key/value database (RocksDB) to store metadata and writes data directly to block devices. It addresses issues with the transactional consistency of FileStore by avoiding double writes and using more efficient data structures. BlueStore aims to provide more natural transaction atomicity, efficient enumeration of objects, and optimal I/O patterns for different storage devices.
A duplicate (clone or snapshot) database is useful for a variety of purposes, most of which involve testing &
upgrade
• You can perform the following tasks in a duplicate database:
• Test backup and recovery procedures
• Test an upgrade to a new release of Oracle Database
• Test the effect of applications on database performance
• Create a standby database (Dataguard) with DG Broker
• Leverage on Transient Logical Standby to perform an upgrade
• Generate reports
This document provides an overview of CPU performance monitoring and capacity planning. It discusses how to compare CPU speeds using benchmarks like TPC-C and SPECint_rate2006. It also explains the difference between cores and threads, and reviews different CPU events that can be monitored like CPU utilization, wait time, and scheduling. The document demonstrates how to analyze CPU performance when migrating platforms and consolidating environments.
Understanding MySQL Performance through BenchmarkingLaine Campbell
The document discusses using benchmarks to measure database performance. It explains that benchmarks help identify bottlenecks, understand system behavior, and plan for growth. The mysqlslap tool is introduced for performing database benchmarks. Mysqlslap can run tests to measure throughput, latency, and stability under different conditions like load, concurrency, and spikes. Examples are provided for how to use mysqlslap to benchmark different engines, schemas, and workloads.
RMAN backup scripts should be improved in the following ways:
1. Log backups thoroughly and send failure alerts to ensure recoverability.
2. Avoid relying on a single backup and use redundancy to protect against data loss.
3. Back up control files last and do not delete archives until backups are complete.
4. Check backups regularly to ensure they meet recovery needs.
ODA Backup Restore Utility & ODA Rescue Live DiskRuggero Citton
When applying maintenance to Oracle Database Appliance, it's best practice to back up the ODA system environment (local system boot disk). DBAs have procedures to backup and recover the database but it is also important that you are able to backup and recover the environment that runs the database. This is especially useful if you encounter an issue during patching; you can quickly restore the system disk back to the pre-patch state.
Out of the box replication in postgres 9.4Denish Patel
This document provides an overview of setting up out of the box replication in PostgreSQL 9.4 without third party tools. It discusses write-ahead logs (WAL), replication slots, pg_basebackup, and pg_receivexlog. The document then demonstrates setting up replication on VMs with pg_basebackup to initialize a standby server, configuration of primary and standby servers, and monitoring of replication.
An introduction to column store indexes and batch modeChris Adkin
This document discusses column store databases and how they work. It explains that column store databases store data by column rather than row to better utilize modern CPU architectures. It describes how column stores use compression techniques like run-length encoding and dictionaries. It also demonstrates how batch processing and sorting data can improve performance of queries against column stores by keeping more data in CPU caches.
ZFS is a filesystem that provides end-to-end data integrity and reliability through the use of checksums, copy-on-write transactions, and pooled storage. Key features include detecting and correcting silent data corruption, eliminating volumes in favor of pooled storage, and providing a transactional design with consistent data. Administration is simplified with only two commands needed to manage the entire storage configuration.
RocksDB is an embedded key-value store that is optimized for fast storage. It uses a log-structured merge-tree to organize data on storage. Optimizing RocksDB for open-channel SSDs would allow controlling data placement to exploit flash parallelism and minimize overhead. This could be done by mapping RocksDB files like SSTables and logs to virtual blocks that map to physical flash blocks in a way that considers data access patterns and flash characteristics. This would improve performance by reducing writes and garbage collection.
This document discusses best practices for optimizing SQL Server performance. It recommends establishing a baseline, identifying bottlenecks, making one change at a time and measuring the impact. It also provides examples of metrics, tools and techniques to monitor performance at the system, database and query levels. These include Windows Performance Monitor, SQL Server Activity Monitor, Dynamic Management Views and trace flags.
ZFS is a filesystem developed for Solaris that provides features like cheap snapshots, replication, and checksumming. It can be used for databases. While it has benefits, random writes become sequential which can hurt performance. The OpenZFS project continues developing ZFS and improved the I/O scheduler to provide smoother write latency compared to the original ZFS write throttle. Tuning parameters in OpenZFS give better control over throughput and latency. Measuring performance is important for optimizing ZFS for database use.
Less is More: 2X Storage Efficiency with HDFS Erasure CodingZhe Zhang
Ever since its creation, HDFS has been relying on data replication to shield against most failure scenarios. However, with the explosive growth in data volume, replication is getting quite expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). Erasure coding (EC) uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication.
The document discusses direct access to Oracle's shared memory (SGA) using C code. It describes the main regions of the SGA, how information is used automatically and by queries, and reasons for direct access such as reading hidden information or during database hangs. It outlines examining the SGA contents through externalized X$ tables, with most structures in the SGA not directly visible. The document provides a procedure to summarize waits from the V$SESSION_WAIT view by mapping it to the underlying X$KSUSECST table fields and offsets.
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Altinity Ltd
ClickHouse is a powerful open source analytics database that provides fast, scalable performance for data warehousing and real-time analytics use cases. It can handle petabytes of data and queries and scales linearly on commodity hardware. ClickHouse is faster than other databases for analytical workloads due to its columnar data storage and parallel processing. It supports SQL and integrates with various data sources. ClickHouse can run on-premises, in the cloud, or in containers. The ClickHouse operator makes it easy to deploy and manage ClickHouse clusters on Kubernetes.
Enable Symantec Backup Exec™ to store and retrieve backup data to Cloud stora...TwinStrata
The following presentation provides a functional overview on:
How Symantec Backup Exec software together with CloudArray work seamlessly to protect virtualized and physical environments
This solution provides an economic and highly reliable storage tier for offsite data backup and replication without the logistics and cost of tape transport.
The presentation also provides a 3-year TCO summary on CloudArray vs. existing backup and replication solutions
Column store indexes and batch processing mode (nx power lite)Chris Adkin
This document discusses SQL Server performance tuning with a focus on leveraging CPU caches through column store compression. It explains how column store compression can bridge the performance gap between IO subsystems and modern processors by breaking data through levels of compression to pipeline batches into CPU caches. Examples are provided showing significant performance improvements from column store compression and clustering over row-based storage and no compression.
Out of the box replication in postgres 9.4(pg confus)Denish Patel
This document contains notes from a presentation on PostgreSQL replication. It discusses write-ahead logs (WAL), replication history in PostgreSQL from versions 7.0 to 9.4, how to set up basic replication, tools for backups and monitoring replication, and demonstrates setting up replication without third party tools using pg_basebackup, replication slots, and pg_receivexlog. It also includes contact information for the presenter and an invitation to join the PostgreSQL Slack channel.
The document discusses various strategies for backing up and recovering PostgreSQL databases. It begins by introducing the speaker and their background. It then covers the objectives of business continuity planning. The main types of backups discussed are logical backups using pg_dump and physical/file-system level backups. Advantages and disadvantages of each approach are provided. Validation of backups through continuous testing of restores is emphasized. Automating the backup process through scripting and configuration management tools is presented as a best practice. Specific tools discussed include pg_basebackup, ZFS snapshots for file-system level backups, Barman and WAL-E for third party managed backups, and examples of custom in-house solutions.
The document discusses table partitioning and sharding in PostgreSQL as approaches to improve performance and scalability as data volumes grow over time. Table partitioning involves splitting a master table into multiple child tables or partitions based on a partition function to distribute data. Sharding distributes partitions across multiple database servers. The document provides steps to implement table partitioning and sharding in PostgreSQL using the Citus extension to distribute a sample sales table across a master and worker node.
In-core compression: how to shrink your database size in several timesAleksander Alekseev
The document discusses techniques for compressing database size in Postgres, including:
1. Using in-core block-level compression as a feature of Postgres Pro EE to shrink database size by several times.
2. The ZSON extension provides transparent JSONB compression by replacing common strings with 16-bit codes and compressing the data.
3. Various schema optimizations like proper data types, column ordering, and packing data can reduce size by improving storage layout and enabling TOAST compression.
This document provides an overview and summary of BlueStore, a new storage backend for Ceph that provides faster performance compared to the previous FileStore backend. BlueStore uses a key/value database (RocksDB) to store metadata and writes data directly to block devices. It addresses issues with the transactional consistency of FileStore by avoiding double writes and using more efficient data structures. BlueStore aims to provide more natural transaction atomicity, efficient enumeration of objects, and optimal I/O patterns for different storage devices.
A duplicate (clone or snapshot) database is useful for a variety of purposes, most of which involve testing &
upgrade
• You can perform the following tasks in a duplicate database:
• Test backup and recovery procedures
• Test an upgrade to a new release of Oracle Database
• Test the effect of applications on database performance
• Create a standby database (Dataguard) with DG Broker
• Leverage on Transient Logical Standby to perform an upgrade
• Generate reports
This document provides an overview of CPU performance monitoring and capacity planning. It discusses how to compare CPU speeds using benchmarks like TPC-C and SPECint_rate2006. It also explains the difference between cores and threads, and reviews different CPU events that can be monitored like CPU utilization, wait time, and scheduling. The document demonstrates how to analyze CPU performance when migrating platforms and consolidating environments.
Understanding MySQL Performance through BenchmarkingLaine Campbell
The document discusses using benchmarks to measure database performance. It explains that benchmarks help identify bottlenecks, understand system behavior, and plan for growth. The mysqlslap tool is introduced for performing database benchmarks. Mysqlslap can run tests to measure throughput, latency, and stability under different conditions like load, concurrency, and spikes. Examples are provided for how to use mysqlslap to benchmark different engines, schemas, and workloads.
RMAN backup scripts should be improved in the following ways:
1. Log backups thoroughly and send failure alerts to ensure recoverability.
2. Avoid relying on a single backup and use redundancy to protect against data loss.
3. Back up control files last and do not delete archives until backups are complete.
4. Check backups regularly to ensure they meet recovery needs.
ODA Backup Restore Utility & ODA Rescue Live DiskRuggero Citton
When applying maintenance to Oracle Database Appliance, it's best practice to back up the ODA system environment (local system boot disk). DBAs have procedures to backup and recover the database but it is also important that you are able to backup and recover the environment that runs the database. This is especially useful if you encounter an issue during patching; you can quickly restore the system disk back to the pre-patch state.
Out of the box replication in postgres 9.4Denish Patel
This document provides an overview of setting up out of the box replication in PostgreSQL 9.4 without third party tools. It discusses write-ahead logs (WAL), replication slots, pg_basebackup, and pg_receivexlog. The document then demonstrates setting up replication on VMs with pg_basebackup to initialize a standby server, configuration of primary and standby servers, and monitoring of replication.
An introduction to column store indexes and batch modeChris Adkin
This document discusses column store databases and how they work. It explains that column store databases store data by column rather than row to better utilize modern CPU architectures. It describes how column stores use compression techniques like run-length encoding and dictionaries. It also demonstrates how batch processing and sorting data can improve performance of queries against column stores by keeping more data in CPU caches.
ZFS is a filesystem that provides end-to-end data integrity and reliability through the use of checksums, copy-on-write transactions, and pooled storage. Key features include detecting and correcting silent data corruption, eliminating volumes in favor of pooled storage, and providing a transactional design with consistent data. Administration is simplified with only two commands needed to manage the entire storage configuration.
RocksDB is an embedded key-value store that is optimized for fast storage. It uses a log-structured merge-tree to organize data on storage. Optimizing RocksDB for open-channel SSDs would allow controlling data placement to exploit flash parallelism and minimize overhead. This could be done by mapping RocksDB files like SSTables and logs to virtual blocks that map to physical flash blocks in a way that considers data access patterns and flash characteristics. This would improve performance by reducing writes and garbage collection.
This document discusses disk I/O performance testing tools. It introduces SQLIO and IOMETER for measuring disk throughput, latency, and IOPS. Examples are provided for running SQLIO tests and interpreting the output, including metrics like throughput in MB/s, latency in ms, and I/O histograms. Other disk performance factors discussed include the number of outstanding I/Os, block size, and sequential vs random access patterns.
CS 542 Putting it all together -- Storage ManagementJ Singh
The document provides an overview and plan for a lecture on database management systems. Key points include:
- By the second break, the lecture will cover storage hierarchies, secondary storage management, and system catalogs.
- After the second break, the topics will include data modeling and storage hierarchies.
- Storage hierarchies involve multiple storage levels from main memory to disk and beyond. The cost and performance of each level differs.
- Techniques like caching aim to keep frequently used data in faster storage levels like memory.
The document provides an overview of using Automatic Workload Repository (AWR) for memory analysis in an Oracle database. It discusses various memory structures like the database buffer cache, shared pool, and process memory. It outlines signs of memory issues and describes analyzing the top waits, load profile, instance efficiency, SQL areas, and other AWR report sections to identify and address performance problems related to memory configuration and usage.
Configuring storage. The slides to this webinar cover how to configure storage for Aerospike. It includes a discussion of how Aerospike uses Flash/SSDs and how to get the best performance out of them.
Find the full webinar with audio here - http://www.aerospike.com/webinars
The document discusses using Automatic Workload Repository (AWR) to analyze IO subsystem performance. It provides examples of AWR reports including foreground and background wait events, operating system statistics, wait histograms. The document recommends using this data to identify IO bottlenecks and guide tuning efforts like optimizing indexes to reduce full table scans.
Fetch policy determines when pages are brought into main memory, with demand paging bringing pages in only when referenced while prepaging brings in more pages, often pages not referenced. The TLB caches page table entries to map virtual to physical addresses, with TLB misses similarly slow as instruction/data cache misses. For page replacement when no free frames exist, a page replacement algorithm selects a victim frame to swap out and free the frame.
Storage and performance- Batch processing, WhiptailInternet World
Batch processing allows jobs to run without manual intervention by shifting processing to less busy times. It avoids idling computing resources and allows higher overall utilization. Batch processing provides benefits like prioritizing batch and interactive work. The document then discusses different approaches to batch processing like dedicating all resources to it or sharing resources. It outlines challenges like systems being unavailable during batch processing. The rest of the document summarizes Whiptail's flash storage solutions for accelerating workloads and reducing costs and resources compared to HDDs.
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Kyle Hailey
The document discusses analyzing I/O performance and summarizing lessons learned. It describes common tools used to measure I/O like moats.sh, strace, and ioh.sh. It also summarizes the top 10 anomalies encountered like caching effects, shared drives, connection limits, I/O request consolidation and fragmentation over NFS, and tiered storage migration. Solutions provided focus on avoiding caching, isolating workloads, proper sizing of NFS parameters, and direct I/O.
This document discusses mass storage systems and their management by operating systems. It covers disk structure, disk scheduling algorithms, disk management including partitioning and file systems, swap space management, RAID configurations, and implementing stable storage. The objectives are to describe mass storage devices, explain their performance characteristics, evaluate disk scheduling, and discuss operating system services for storage like RAID.
Reference Architecture: Architecting Ceph Storage Solutions Ceph Community
This document discusses reference architectures for Ceph storage solutions. It provides guidance on key design considerations for Ceph clusters, including workload profiling, storage access methods, capacity planning, fault tolerance, and data protection schemes. Example hardware configurations are also presented for different performance and cost optimization targets.
KSCOPE 2013: Exadata Consolidation Success StoryKristofferson A
This document summarizes an Exadata consolidation success story. It describes how three Exadata clusters were consolidated to host 60 databases total. Tools and methodology used included gathering utilization metrics, creating a provisioning plan, implementing the plan, and auditing. The document describes some "war stories" including resolving a slow HR time entry system through SQL profiling, addressing a memory exhaustion issue from an OBIEE report, and using I/O resource management to prioritize critical processes when storage cells became saturated.
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Guy Harrison
This document discusses solid state disk (SSD) technologies and how they can improve performance in Oracle databases. It provides an overview of traditional hard disk drives and how SSDs address limitations in seek time and throughput. SSD performance, economics, and flash memory concepts are explained. The document focuses on Oracle's 11gR2 database flash cache architecture, which uses idle CPU cycles to cache hot database blocks to flash to reduce physical read I/O. Performance comparisons show reductions in read latency and write complete waits with the flash cache enabled. Recommendations include using SSDs for small read-heavy tablespaces and enabling the database flash cache.
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
This document discusses building a high-performance and durable block storage service using Ceph. It describes the architecture, including a minimum deployment of 12 OSD nodes and 3 monitor nodes. It outlines optimizations made to Ceph, Qemu, and the operating system configuration to achieve high performance, including 6000 IOPS and 170MB/s throughput. It also discusses how the CRUSH map can be optimized to reduce recovery times and number of copysets to improve durability to 99.99999999%.
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward
The document discusses Flink's use of internal data structures to efficiently support checkpointing. It describes how Flink uses RocksDB, a log-structured merge tree database, as a backend to support asynchronous and incremental checkpoints. RocksDB allows checkpoints to be taken with low overhead by creating immutable snapshots of the on-disk data structures. It also facilitates incremental checkpoints by efficiently detecting state changes between checkpoints based on the creation and deletion of immutable sorted string tables. The document provides an example to illustrate how RocksDB integrates with the distributed filesystem and job manager to support incremental checkpointing. It also discusses how Flink uses a copy-on-write hash map approach with the heap state backend to support asynchronous checkpoints while detecting state
This document discusses scalable storage configuration for physics database services. It outlines challenges with storage configuration, best practices like using all available disks and striping data, and Oracle's ASM solution. The document presents benchmark data measuring performance of different storage configurations and recommendations for sizing new projects based on stress testing and benchmark data.
The cherry: beauty, softness, its heart-shaped plastic has inspired artists since Antiquity. Cherries and strawberries were considered the fruits of paradise and thus represented the souls of men.
❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart Final Matka Satta Matta Matka 143 Kalyan Chart Satta fix Jodi Kalyan Final ank Matka Boss Satta 143 Matka 420 Golden Matka Final Satta Kalyan Penal Chart Dpboss 143 Guessing Kalyan Night Chart
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka ! Fix Satta Matka ! Matka Result ! Matka Guessing ! Final Matka ! Matka Result ! Dpboss Matka ! Matka Guessing ! Satta Matta Matka 143 ! Kalyan Matka ! Satta Matka Fast Result ! Kalyan Matka Guessing ! Dpboss Matka Guessing ! Satta 143 ! Kalyan Chart ! Kalyan final ! Satta guessing ! Matka tips ! Matka 143 ! India Matka ! Matka 420 ! matka Mumbai ! Satta chart ! Indian Satta ! Satta King ! Satta 143 ! Satta batta ! Satta मटका ! Satta chart ! Matka 143 ! Matka Satta ! India Matka ! Indian Satta Matka ! Final ank
KALYAN MATKA | MATKA RESULT | KALYAN MATKA TIPS | SATTA MATKA | MATKA.COM | MATKA PANA JODI TODAY | BATTA SATKA | MATKA PATTI JODI NUMBER | MATKA RESULTS | MATKA CHART | MATKA JODI | SATTA COM | FULL RATE GAME | MATKA GAME | MATKA WAPKA | ALL MATKA RESULT LIVE ONLINE | MATKA RESULT | KALYAN MATKA RESULT | DPBOSS MATKA 143 | MAIN MATKA
Boudoir photography, a genre that captures intimate and sensual images of individuals, has experienced significant transformation over the years, particularly in New York City (NYC). Known for its diversity and vibrant arts scene, NYC has been a hub for the evolution of various art forms, including boudoir photography. This article delves into the historical background, cultural significance, technological advancements, and the contemporary landscape of boudoir photography in NYC.
This tutorial offers a step-by-step guide on how to effectively use Pinterest. It covers the basics such as account creation and navigation, as well as advanced techniques including creating eye-catching pins and optimizing your profile. The tutorial also explores collaboration and networking on the platform. With visual illustrations and clear instructions, this tutorial will equip you with the skills to navigate Pinterest confidently and achieve your goals.
1. Ch 9. Storing Data: Disks and Files
- Heap File Structure -
Sang-Won Lee
http://icc.skku.ac.kr/~swlee
SKKU VLDB Lab. & SOS
( http://vldb.skku.ac.kr/ )
2. 2 SKKU VLDB Lab.Ch 9. Storing Disk
Contents
9.0 Overview
9.1 Memory Hierarchy
9.2 RAID(Redundant Array of Independent Disk)
9.3 Disk Space Management
9.4 Buffer Manager
9.5 Files of Records
9.6 Page Format
9.7 Record Format
3. 3 SKKU VLDB Lab.Ch 9. Storing Disk
Memory Hierarchy
Smaller, Faster,
Expensive, Volatile
Bigger, Slower,
Cheaper, Non-Volatile
Main memory (RAM) for currently used data.
Disk for the main database (secondary storage).
Tapes for archiving older versions of the data (tertiary storage)
WHY MEMORY HIERARCHY?
What if ideal storage appear? Fast, cheap, large, NV..: PCM, MRAM, FeRAM?
4. 4 SKKU VLDB Lab.Ch 9. Storing Disk
Jim Gray’s Storage Latency Analogy:
How Far Away is the Data?
Registers
On Chip Cache
On Board Cache
Memory
Disk
1
2
10
100
Tape /Optical
Robot
10 9
10 6
Sacramento
This Hotel
This Room
My Head
10 min
1.5 hr
2 Years
1 min
Pluto
2,000 Years
Andromeda
5. 5 SKKU VLDB Lab.Ch 9. Storing Disk
Disks and Files
DBMS stores information on (hard) disks.
− Electronic (CPU, DRAM) vs. Mechanical (harddisk)
This has major implications for DBMS design!
− READ: transfer data from disk to main memory (RAM).
− WRITE: transfer data from RAM to disk.
− Both are expensive operations, relative to in-memory operations, so
must be planned carefully!
DRAM: ~ 10 ns
Harddisk: ~ 10ms
SSD: 80us ~ 10ms
6. 6 SKKU VLDB Lab.Ch 9. Storing Disk
Disks
Secondary storage device of choice.
Main advantage over tapes: random access vs. sequential.
− Tapes deteriorate over time
Data is stored and retrieved in disk blocks or pages unit.
Unlike RAM, time to retrieve a disk page varies depending upon
location on disk.
− Thus, relative placement of pages on disk has big impact on DB
performance!
e.g. adjacent allocation of the pages from the same tables.
− We need to optimize both data placement and access
e.g. elevator disk scheduling algorithm
7. 7 SKKU VLDB Lab.Ch 9. Storing Disk
Anatomy of a Disk
Arm assembly
The platters spin
e.g. 5400 / 7200 / 15K rpm
The arm assembly is moved in or out to
position a head on a desired track. Tracks
under heads make a cylinder
Mechanical storage -> low IOPS
Only one head reads/writes at any one
time.
Parallelism degree: 1
Block size is a multiple of sector size
Update-in-place: poisoned apple
No atomic write
Fsync for ordering / durability
8. 8 SKKU VLDB Lab.Ch 9. Storing Disk
Accessing a Disk Page
Time to access (read/write) a disk block:
− seek time (moving arms to position disk head on track)
− rotational delay (waiting for block to rotate under head)
− transfer time (actually moving data to/from disk surface)
Seek time and rotational delay dominate.
− Seek time: about 1 to 20msec
− Rotational delay: from 0 to 10msec
− Transfer rate: about 1ms per 4KB page
Key to lower I/O cost: reduce seek/rotation delays!
− E.g. disk scheduling algorithm in OS, Linux 4 I/O schedulers
9. 9 SKKU VLDB Lab.Ch 9. Storing Disk
Arranging Pages on Disk
`Next’ block concept:
− Blocks on same track, followed by
− Blocks on same cylinder, followed by
− Blocks on adjacent cylinder
Blocks in a file should be arranged sequentially on disk (by `next’),
to minimize seek and rotational delay.
Disk fragmentation problem
− Is this still problematic in flash storage?
10. 10
Table, Insertions, Heap Files
CREATE TABLE TEST (a int, b int, c varchar2(650));
/* Insert 1M tuples into TEST table (approximately 664 bytes per
tuple) */
BEGIN
FOR i IN 1..1000000 LOOP
INSERT INTO TEST (a, b, c) values (i, i, rpad('X', 650, 'X'));
END LOOP;
END;
/*
Page = 8KB
10 tuples / page
100,000 pages in total
TEST table = 800MB
*/
Data
Page 1
Data
Page 2
Data
Page
100,000
TEST SEGMENT
(Heap File)
Data
Page i
11. 11
OLAP vs. OLTP
On-Line Analytical vs. Transactional Processing
SQL> SELECT SUM(b) FROM TEST;
SUM(B)
----------
5.0000E+11
Execution Plan
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 22053 (1)| 00:04:25 |
| 1 | SORT AGGREGATE | | 1 | 5 | | |
| 2 | TABLE ACCESS FULL| TEST | 996K| 4865K| 22053 (1)| 00:04:25
|
---------------------------------------------------------------------------
Statistics
----------------------------------------------------------
179 recursive calls
0 db block gets
100152 consistent gets
100112 physical reads
…..
1 rows processed
Data
Page 1
Data
Page 2
Data
Page
100,000
TEST SEGMENT
(Heap File)
Data
Page i
12. 12
OLAP vs. OLTP
SQL> SELECT B FROM TEST WHERE A = 500000;
B
----------
500000
Execution Plan
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 22053 (1)| 00:04:25 |
| 1 | SORT AGGREGATE | | 1 | 5 | | |
| 2 | TABLE ACCESS FULL| TEST | 996K| 4865K| 22053 (1)| 00:04:25
|
---------------------------------------------------------------------------
Statistics
----------------------------------------------------------
179 recursive calls
0 db block gets
100152 consistent gets
100112 physical reads
…..
1 rows processed
Data
Page 1
Data
Page 2
Data
Page
100,000
TEST SEGMENT
(Heap File)
Data
Page i
Point or Range Query
13. 13
OLAP vs. OLTP
CREATE INDEX TEST_A ON
TEST(A);
Data
Page 1
Data
Page 2
Data
Page
100000
SELECT B /* point query */
FROM TEST
WHERE A = 500000;
SELECT B /* range query */
FROM TEST
WHERE A BETWEEN 50001 and 50101;
B-tree Index on TEST(A)
SEARCH KEY: 500000
(500000, (50000,10) )
(500000,500000, ’XX…XX’;
Block #: 50000SLOT #: 10
Cost:
Full Table Scan: 100,000 Block Accesses
Index: (3~4) + Data Block Access
− Point query: 1
− Range queries: depending on range
TEST SEGMENT
(Heap File)
14. 14
OLAP vs. OLTP
SQL> SELECT B FROM TEST WHERE A = 500000;
B
----------
500000
Execution Plan
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10 | 4 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TEST | 1 | 10 | 4 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | TEST_A | 1 | | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
…..
5 consistent gets
4 physical reads
…..
1 rows processed
Index-based Table Access
15. 15
OLTP: TPC-A/B/C Benchmark
exec sql begin declare section;
long Aid, Bid, Tid, delta, Abalance;
exec sql end declare section;
DCApplication()
{ read input msg;
exec sql begin work;
exec sql update accounts set Abalance = Abalance + :delta where Aid = :Aid;
exec sql select Abalance into :Abalance from accounts where Aid = :Aid;
exec sql update tellers set Tbalance = Tbalance + :delta where Tid = :Tid;
exec sql update branches set Bbalance = Bbalance + :delta where Bid = :Bid;
exec sql insert into history(Tid, Bid, Aid, delta, time) values (:Tid, :Bid, :Aid, :delta, CURRENT);
send output msg;
exec sql commit work; }
From Gray’s Presentation
Transaction =
Sequence of Reads and Writes
17. 17 SKKU VLDB Lab.Ch 9. Storing Disk
Concurrency
Control
Transaction
Manager
Lock
Manager
Files and Access Methods
Buffer Manager
Disk Space Manager
Recovery
Manager
Plan Executor
Operator Evaluator
Parser
Optimizer
DBMS
shows interaction
Query
Evaluation
Engine
Index Files
Data Files
System Catalog
shows references
DATABASE
Application Front EndsWeb Forms SQL Interface
SQL COMMANDS
Unsophisticated users (customers, travel agents, etc.) Sophisticated users, application
programmers, DB administrators
shows command flow
Figure 1.3 Architecture of a DBMS
Shared
Components
Per
Connection
18. 18 SKKU VLDB Lab.Ch 9. Storing Disk
9.4 Buffer Management in a DBMS
Data must be in RAM
for DBMS to operate on
it!
DB
MAIN MEMORY
DISK
disk page
free frame
Page Requests from Higher Levels
BUFFER POOL
choice of frame dictated
by replacement policy
19. 19 SKKU VLDB Lab.Ch 9. Storing Disk
When a Page is Requested ...
Buffer pool information table: <frame#, pageid, pin_cnt, dirty>
− In big systems, it is not trivial to just check whether a page is in pool
If requested page is not in pool:
− Choose a frame for replacement
− If frame is dirty, write it to disk
− Read requested page into chosen frame
Pin the page and return its address.
If requests can be predicted (e.g., sequential scans) pages can be
pre-fetched several pages at a time!
20. 20 SKKU VLDB Lab.Ch 9. Storing Disk
More on Buffer Management
Requestor of page must unpin it, and indicate whether page has
been modified:
− dirty bit is used for this.
Page in pool may be requested many times,
− a pin count is used.
− a page is a candidate for replacement iff pin count = 0.
CC & recovery may entail additional I/O when a frame is chosen for
replacement. (e.g. Write-Ahead Log protocol)
22. 22 SKKU VLDB Lab.Ch 9. Storing Disk
Buffer Replacement Policy
Hit vs. miss
Hit ratio = # of hits / ( # of page requests to buffer cache)
− One miss incurs one (or two) physical IO. Hit saves IO.
− Rule of thumb: at least 80 ~ 90%
Problem: for the given (future) references, which victim should be
chosen for highest hit ratio (i.e. least # of IOs)?
− Numerous policies
− Does one policy win over the others?
− One policy does not fit all reference patterns!
23. 23 SKKU VLDB Lab.Ch 9. Storing Disk
Buffer Replacement Policy
Frame is chosen for replacement by a replacement policy:
− Random, FIFO, LRU, MRU, LFU, Clock etc.
− Replacement policy can have big impact on # of I/O’s; depends on the
access pattern
For a given workload, one replacement policy, A, achieves 90% hit
ratio and the other, B, does 91%.
− How much improvement? 1% or 10%?
− We need to interpret its impact in terms of miss ratio, not hit ratio
24. 24 SKKU VLDB Lab.Ch 9. Storing Disk
Buffer Replacement Policy (2)
Least Recently Used (LRU)
− For each page in buffer pool, keep track of time last unpinned
− Replace the frame that has the oldest (earliest) time
− Very common policy: intuitive and simple
− Why does it work?
``Principle of (temporal) locality” (of references) (https://en.wikipedia.org/wiki/Locality_of_reference)
Why temporal locality in database?
− The correct implementation is not trivial
Especially in large scale systems: e.g. time stamp
Variants
− Linked list of buffer frames, LRU-K, 2Q, midpoint-insertion and touch count
algorithm(Oracle), Clock, ARC …
− Implication of big memory: “random” > “LRU”??
25. 25 SKKU VLDB Lab.Ch 9. Storing Disk
LRU and Sequential Flooding
Problem of LRU - sequential flooding
− caused by LRU + repeated sequential scans.
− # buffer frames < # pages in file means each page request causes an I/O. MRU
much better in this situation (but not in all situations, of course).
DB
BUFFER POOL SIZE: 4 Blocks
1 2 3 4 5
1 2 3 4
Assume repeated
sequential scans of file A
What happens when
reading 5th blocks? and
when reading 1st block
again? ….
File A
26. 26 SKKU VLDB Lab.Ch 9. Storing Disk
“Clock” Replacement Policy
An approximation of LRU
Arrange frames into a cycle, store one reference bit per frame
When pin count reduces to 0, turn on reference bit
When replacement necessary
do for each page in cycle {
if (pincount == 0 && ref bit is on)
turn off ref bit;
else if (pincount == 0 && ref bit is off)
choose this page for replacement;
} until a page is chosen;
Second chance algorithm / bit
Generalized Clock (GCLOCK): ref. bit ref. counter
A(1)
B(p)
C(1)
D(1)
28. 28 SKKU VLDB Lab.Ch 9. Storing Disk
Why Not Store It All in Main Memory?
Cost!: $20 /1GB GB DRAM vs. 50$ / 150 GB of disk (EIDI/ATA) vs.
100/30GB (SCSI).
− High-end databases today in the 10-100 TB range.
− Approx. 60% of the cost of a production system is in the disks.
Some specialized systems (e.g. Main Memory(MM) DBMS) store
entire database in main memory.
− Vendors claim 10x speed up vs. traditional DBMS in main memory.
− Sap Hana, MS Hekaton, Oracle In-memory, Altibase ..
Main memory is volatile: data should be saved between runs.
− Disk write is inevitable: log write for recovery and periodic checkpoint
29. 29
CPU
(Dual, Quad, …)
IOPS
Hit Ratio Multi-threading
(CPU-IO Overlapping)Data size
Buffer Size
TPS
(Transactions Per Second)
Context
Switching
+
CPU-IO Overlapping
3 States: CPU Bound, IO Bound, Balanced
For perfect CPU-IO overlapping,
IOPS Matters!
30. 30
IOPS Crisis in OLTP
IBM for TPC-C (2008 Dec.)
6M tpmC
Total cost: 35M $
− Server HW: 12M $
− Server SW: 2M $
− Storage: 20M $
− Client HW/SW: 1M $
They are buying IOPS, not capacity
31. 31
IOPS Crisis in OLTP(2)
For balanced systems, OLTP systems pay huge $$$ on disks for high
IOPS
CPU + Server
300 GIPS
IOPS
10,000 disks
12M $
20M $
A balanced state
Amdhal’s law
IOPS
10,000 disks
A balanced state???
50% CPU utilization;
Same TPS
12M $
CPU + Server
600 GIPS18 months
(Moore’s law)
IOPS
20,000 disks
(short stroking)
A balanced state;
2 X TPS
40M $
32. 32 SKKU VLDB Lab.Ch 9. Storing Disk
Some Techniques to Hide IO Bottlenecks
Pre-fetching: For a sequential scan, pre-fetching several pages at a
time is a big win! Even cache / disk controller support prefetching
Caching: modern disk controllers do their own caching.
IO overlapping: CPU works while IO is performing
− Double buffering, asynchronous IO
Multiple threads
And, don’t do IOs, avoid IOs
33. 33 SKKU VLDB Lab.Ch 9. Storing Disk
MMDBMS vs. All-Flash DBMS: Personal Thoughts
Why MMDBMS has been recently popular?
− Sap Hana, MS Hekaton, Oracle In-memory, Altibase, ….
− Disk IOPS was too expensive
− The price of DRAM has ever dropped.: DRAM_Cost << Disk_IOPS_Cost
− The overhead of disk-based DBMS is not negligible
− Applications with extreme performance requirements
Flash storage
− Low $/IOPS
− DRAM_Cost > Flash_IOPS_Cost
MMDBMS vs. All-Flash DBMS (with some optimizations)
− Winner? Time will tell
34. 34
Power Consumption Issue in Big Memory
SKKU VLDB Lab.Ch 9. Storing Disk
Why exponential?
1KWh = 15 ~ 50 cents, 1 year = 1,752$
Source: sigmod 17 keynote by Anastatia Ailamaki
35. 35
Jim Gray Flash Disk Opportunity for Server Applications
(MS-TR 2007)
“My tests and those of several others suggest that FLASH disks can deliver about 3K random 8KB reads/second and with
some re-engineering about 1,100 random 8KB writes per second. Indeed, it appears that a single FLASH chip could deliver
nearly that performance and there are many chips inside the “box” – so the actual limit could be 4x or more. But, even the
current performance would be VERY attractive for many enterprise applications. For example, in the TPC-C benchmark, has
approximately equal reads and writes. Using the graphs above, and doing a weighted average of the 4-deep 8 KB random
read rate (2,804 IOps), and 4-deep 8 KB sequential write rate (1233 IOps) gives harmonic average of 1713 (1-deep gives
1,624 IOps). TPC-C systems are configured with ~50 disks per cpu. For example the most recent Dell TPC-C system has
ninety 15Krpm 36GB SCSI disks costing 45k$ (with 10k$ extra for maintenance that gets “discounted”). Those disks are
68% of the system cost. They deliver about 18,000 IO/s. That is comparable to the requests/second of ten FLASH disks. So
we could replace those 90 disks with ten NSSD if the data would fit on 320GB (it does not). That would save a lot of money
and a lot of power (1.3Kw of power and 1.3Kw of cooling).” (excerpts from “Flash disk opportunity for server-applications”)
36. 36
Flash SSD vs. HDD: Non-Volatile Storage
VS
60 years champion
A new challenger!
Identical
Interface
Flash SSD HDD
Electronic Mechanical
Read/Write Asymmetric Symmetric
No Overwrite Overwrite
38. 38
Storage Device Metrics
Capacity ($/GB) : Harddisk >> Flash SSD
Bandwidth (MB/sec): Harddisk < Flash SSD
Latency (IOPS): Harddisk << Flash SSD
− e.g. Harddisk
Commodity hdd (7200rpm): 50$ / 1TB / 100MB/s / 100 IOPS
Enterprise hdd(1.5Krpm): 250$ / 72GB / 200MB/s / 500 IOPS
The price of harddisks is said to be proportional to IOPS, not capacity.
39. 39
Storage Device Metrics(2): HDD vs. Flash SSDs
SKKU VLDB Lab.Ch 9. Storing Disk
Other Metrics: Weight/shock resistance/heat & cooling,
power(watt) , IOPS/watt, IOPS/$ ….
− Harddisk << Flash SSD
[Source: Rethinking Flash In the Data Center, IEEE Micro 2010]
40. 40
Evolution of DRAM, HDD, and SSD (1987 – 2017)
Raja et. al., The Five-minute Rule Thirty Years Later and its Impact
on Storage Hierarchy, ADMS ‘17
the Storage Hierarchy
SKKU VLDB Lab.Ch 9. Storing Disk
41. 41 SKKU VLDB Lab.Ch 9. Storing Disk
Technology RATIOS Matter
[ Source: Jim Gray’s PPT ]
Technology ratio change: 1980s vs. 2010s
If everything changes in the same way, then nothing really changes.[See next slide]
If some things get much cheaper/faster than others, then that is real change.
Some things are not changing much: e.g. cost of people, speed of light
And some things are changing a LOT: e.g. Moore’s law, disk capacity
Latency lags behind bandwidth
Bandwidth lags behind capacity
Flash memory/NVRAMs and its role in the memory hierarchy? Disruptive
technology ratio change new disruptive solution!!
Disk improvements [‘88 – ’04]:
− Capacity: 60%/y
− Bandwidth: 40%/y
− Access time: 16%/y
42. 42
Latency Gap in Memory Hierarchy
SKKU VLDB Lab.Ch 9. Storing Disk
Typical access latency & granularity
We need `gap filler’!
[Source: Uwe Röhm’s Slide]
Latency lags behind bandwidth [David Patternson, CACM Oct. 2004]
Bandwidth problem can be cured with money, but latency problem is harder
43. 43
Our Message in SIGMOD ’09 Paper
One FlashSSD can beat Ten Harddisks in OLTP
- Performance, Price, Capacity, Power – (in 2008)
In 2015, one FlashSSD can beat more than several tens harddisks in OLTP
“My tests and those of several others suggest that FLASH disks can deliver about 3K random 8KB reads/second and with
some re-engineering about 1,100 random 8KB writes per second. Indeed, it appears that a single FLASH chip could deliver
nearly that performance and there are many chips inside the “box” – so the actual limit could be 4x or more. But, even the
current performance would be VERY attractive for many enterprise applications. For example, in the TPC-C benchmark, has
approximately equal reads and writes. Using the graphs above, and doing a weighted average of the 4-deep 8 KB random
read rate (2,804 IOps), and 4-deep 8 KB sequential write rate (1233 IOps) gives harmonic average of 1713 (1-deep gives
1,624 IOps). TPC-C systems are configured with ~50 disks per cpu. For example the most recent Dell TPC-C system has
ninety 15Krpm 36GB SCSI disks costing 45k$ (with 10k$ extra for maintenance that gets “discounted”). Those disks are
68% of the system cost. They deliver about 18,000 IO/s. That is comparable to the requests/second of ten FLASH disks. So
we could replace those 90 disks with ten NSSD if the data would fit on 320GB (it does not). That would save a lot of money
and a lot of power (1.3Kw of power and 1.3Kw of cooling).” (excerpts from Flash disk opportunity for server-applications
(Jim Gray)
45. 45
HDD vs. SSD [Patterson 2016]
SKKU VLDB Lab.Ch 9. Storing Disk
46. 46
Page Size
Default page size in Linux and Window: 4KB ( 512B sector)
Oracle
− “Oracle recommends smaller Oracle Database block sizes (2 KB or 4 KB) for online
transaction processing or mixed workload environments and larger block sizes (8 KB,16
KB, or 32 KB) for decision support system workload environments.” (see chapter “IO
Config. And Design” in “Database Performance Tuning Guide” book )
SKKU VLDB Lab.Ch 9. Storing Disk
47. 47
Why Smaller Page in SSD?
Small page size advantages – Better throughput
47
48. 48
LinkBench: Page Size
Benefits of small page
− Better read/write IOPS
Exploit internal parallelism
− Better buffer-pool hit ratio
Editor's Notes
1
21
4
6
7
7
7
7
Next, we measured the impact of page size on overall performance.
In this experiment, we disabled WRITE_BARRIER and double write buffer.
As you can see on the left, smaller page delivers better performance with LinkBench.
Small page improves effectiveness of data transfer and is likely to leverage internal parallelism of SSD better.
In addition, we also monitored the hit ratio of buffer pool varying the page size.
As you can see on the right, smaller pages deliver better hit rates and the gap becomes wider with larger buffer pool.
Again, smaller pages reduce the amount of unnecessary data in the buffer pool and utilize the bandwidth better, by reducing the unnecessary data transfer between host and storage device.
----------------------------------------------------------
In this slide, I will explain the advantages of using a small page size by using performance results.
As you can see from the left graph,
when the write-barrier option was off, the performance gain obtained by using a small page size was significant.
More than two-fold increase in transaction throughput by reducing the page size from 16KB to 4KB.
This is because a flash memory SSD can exploit the internal parallelism maximally when the write barrier is off.
Another benefit from using a small page is an improved buffer pool cache effect.
The right graph shows the buffer pool hit ratio observed from running the LinkBench workload with different page sizes.
With seeing the graph, the buffer hit ratio was increased more quickly with 4KB page size than others.
This higher hit ratio, in combination with the higher IOPS with a small page,
contributes to the widening throughput gap among the three page sizes as the size of buffer pool increases.
Due to the limited time, we did not include the transaction throughput graph by varying buffer pool size.
Please refer to the paper.