The document discusses the evolution of scalable game server backends used by a company over 4 generations. The first backend used MySQL and had bottlenecks. The second used Redis but was still sensitive to latency. The third was stateful using Erlang and improved availability but had challenges. The fourth aims to improve on multiplayer, real-time updates, operations, and removing single points of failure while using hosted databases. Future backends may use Erlang for harder games with a potential new revolution in design.
The document discusses database cloning challenges and solutions for thin provision cloning using various technologies like Oracle CloneDB, EMC BCV, SRDF, VMware snapshots, ZFS, and NetApp FlexClone. It also covers database virtualization solutions like Oracle SMU and Delphix that provide self-service virtual databases to developers by sharing blocks between clones for faster provisioning and development cycles. Case studies described how virtualization can accelerate development by providing frequent fresh clones of the source database.
The document describes the structure and organization of the UNIX operating system. It discusses three major components: the shell, kernel, and hardware. The shell conveys commands from users to the kernel to execute. The kernel resides between the hardware and shell and manages processes, memory, I/O, and other low-level services. Utilities are standard programs that perform system functions like moving files, while applications provide extended capabilities and are created by users, administrators, and programmers rather than being part of the core UNIX system.
Some key value stores using log-structureZhichao Liang
This slides presents three key-value stores using log-structure, includes Riak, RethinkDB, LevelDB. BTW, i state that RethinkDB employs append-only B-tree and that is an estimate made by combining guessing wih reasoning!
Data Consistency Enhancement in Writeback mode of Journaling using BackpointersRohan Waghere
This document presents a mechanism to enhance data consistency in the writeback mode of journaling file systems. The mechanism uses backpointers to provide a mutual agreement between data blocks and their parent inode, allowing consistency checks to be performed on-demand without ordering constraints. Backpointers containing the inode number are written with each data block. During reads, the backpointer is verified against the inode number to ensure logical identity. This provides consistency comparable to eager journaling modes like ordered mode, while retaining the performance advantages of writeback mode. Evaluation shows the backpointer approach has higher throughput than traditional journaling modes while enabling detection of inconsistencies.
Scylla Summit 2016: Compose on Containing the DatabaseScyllaDB
This document discusses how Compose applies containerization best practices to provide database services. It outlines the "Twelve Factors of Stateful Apps" that guide Compose's architecture. These include running databases and data in separate containers, using environment variables for configuration, scaling containers vertically before adding nodes, and collecting logs and metrics within the deployment. By applying these factors, Compose can reliably deploy a range of database technologies like MongoDB, PostgreSQL, and now ScyllaDB across its platform.
This document discusses using logical volume management (LVM) with MySQL for enterprise storage. Some key points include:
- LVM allows virtualizing storage into logical volumes mapped to physical disks, enabling features like snapshots and clones for backups and replication.
- Snapshots provide point-in-time copies while clones are writable copies, both using copy-on-write to avoid duplicating data.
- Backing up MySQL with snapshots is fast, and recovery involves reverting the logical volumes rather than restoring files. Replication slaves can be quickly recloned from snapshots.
- With shared storage, replication slaves maintain the same blocks as the master for efficient scaling and resyncing slaves by recloning
SOUG PDB Security, Isolation and DB Nest 20cStefan Oehrli
Lockdown Profile, PDB_OS_CREDENTIALS and other measures to enhance security and isolation of multitenant databases are available since Oracle 12c. Unfortunately only a part of the desired measures can be technically implemented. With the latest release of Oracle 20c a new features called DB Nest has been introduced. DB Nest introduced an other approach to security in PDBs. In this presentation we will discuss the new approach and its possibilities to increase database security of PDBs. The presentation will be completed by corresponding examples and live demos.
The document discusses the evolution of scalable game server backends used by a company over 4 generations. The first backend used MySQL and had bottlenecks. The second used Redis but was still sensitive to latency. The third was stateful using Erlang and improved availability but had challenges. The fourth aims to improve on multiplayer, real-time updates, operations, and removing single points of failure while using hosted databases. Future backends may use Erlang for harder games with a potential new revolution in design.
The document discusses database cloning challenges and solutions for thin provision cloning using various technologies like Oracle CloneDB, EMC BCV, SRDF, VMware snapshots, ZFS, and NetApp FlexClone. It also covers database virtualization solutions like Oracle SMU and Delphix that provide self-service virtual databases to developers by sharing blocks between clones for faster provisioning and development cycles. Case studies described how virtualization can accelerate development by providing frequent fresh clones of the source database.
The document describes the structure and organization of the UNIX operating system. It discusses three major components: the shell, kernel, and hardware. The shell conveys commands from users to the kernel to execute. The kernel resides between the hardware and shell and manages processes, memory, I/O, and other low-level services. Utilities are standard programs that perform system functions like moving files, while applications provide extended capabilities and are created by users, administrators, and programmers rather than being part of the core UNIX system.
Some key value stores using log-structureZhichao Liang
This slides presents three key-value stores using log-structure, includes Riak, RethinkDB, LevelDB. BTW, i state that RethinkDB employs append-only B-tree and that is an estimate made by combining guessing wih reasoning!
Data Consistency Enhancement in Writeback mode of Journaling using BackpointersRohan Waghere
This document presents a mechanism to enhance data consistency in the writeback mode of journaling file systems. The mechanism uses backpointers to provide a mutual agreement between data blocks and their parent inode, allowing consistency checks to be performed on-demand without ordering constraints. Backpointers containing the inode number are written with each data block. During reads, the backpointer is verified against the inode number to ensure logical identity. This provides consistency comparable to eager journaling modes like ordered mode, while retaining the performance advantages of writeback mode. Evaluation shows the backpointer approach has higher throughput than traditional journaling modes while enabling detection of inconsistencies.
Scylla Summit 2016: Compose on Containing the DatabaseScyllaDB
This document discusses how Compose applies containerization best practices to provide database services. It outlines the "Twelve Factors of Stateful Apps" that guide Compose's architecture. These include running databases and data in separate containers, using environment variables for configuration, scaling containers vertically before adding nodes, and collecting logs and metrics within the deployment. By applying these factors, Compose can reliably deploy a range of database technologies like MongoDB, PostgreSQL, and now ScyllaDB across its platform.
This document discusses using logical volume management (LVM) with MySQL for enterprise storage. Some key points include:
- LVM allows virtualizing storage into logical volumes mapped to physical disks, enabling features like snapshots and clones for backups and replication.
- Snapshots provide point-in-time copies while clones are writable copies, both using copy-on-write to avoid duplicating data.
- Backing up MySQL with snapshots is fast, and recovery involves reverting the logical volumes rather than restoring files. Replication slaves can be quickly recloned from snapshots.
- With shared storage, replication slaves maintain the same blocks as the master for efficient scaling and resyncing slaves by recloning
SOUG PDB Security, Isolation and DB Nest 20cStefan Oehrli
Lockdown Profile, PDB_OS_CREDENTIALS and other measures to enhance security and isolation of multitenant databases are available since Oracle 12c. Unfortunately only a part of the desired measures can be technically implemented. With the latest release of Oracle 20c a new features called DB Nest has been introduced. DB Nest introduced an other approach to security in PDBs. In this presentation we will discuss the new approach and its possibilities to increase database security of PDBs. The presentation will be completed by corresponding examples and live demos.
Lots of small objects in a swift cluster can lead to performance issues on the object servers. We propose a backend change to improve performance for this workload.
This document provides an overview of Oracle 12c Pluggable Databases (PDBs). Key points include:
- PDBs allow multiple databases to be consolidated within a single container database (CDB), providing benefits like faster provisioning and upgrades by doing them once per CDB.
- Each PDB acts as an independent database with its own data dictionary but shares resources like redo logs at the CDB level. PDBs can be unplugged from one CDB and plugged into another.
- Hands-on labs demonstrate how to create, open, clone, and migrate PDBs between CDBs. The document also compares characteristics of CDBs and PDBs and shows how a non-C
This document provides an overview of Oracle 9i Real Application Clusters (RAC) on Linux. It discusses the benefits of RAC such as scalability, high availability, and transparent expansion. Key components of RAC are described including cache fusion, global cache management, and resource coordination. Failure detection and recovery processes are also summarized. The document concludes with information on configuring Oracle 9i RAC and Linux kernel parameters on Linux systems.
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...Kuniyasu Suzaki
Ext2/3optimizer and readahead optimization can improve the performance of LBCAS (LoopBack Content Addressable Storage). Ext2/3optimizer rearranges file system blocks based on access profiles to reduce the number of small readahead requests. This increases the readahead coverage size, reduces redundant block downloads, and improves disk access locality. Performance analysis shows these optimizations reduce I/O utilization and consumption time in LBCAS during system booting.
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...Blackboard APAC
Within a year, USC have enhanced various system administration tasks. From length file and database interrogation, we are now running with a proactive instant alerting process where incidents are captured and actioned before staff and students are impacted. A number of commercial, open-source and in-house tools have been utilised to facilitate these improvements and sights are now set on shifting to self-healing incidents.
Delivered at Innovate and Educate: Teaching and Learning Conference by Blackboard. 24 -27 August 2015 in Adelaide, Australia.
Database as a Service on the Oracle Database Appliance PlatformMaris Elsins
Speaker: Marc Fielding, Co-speaker: Maris Elsins.
Oracle Database Appliance provides a robust, highly-available, cost-effective, and surprisingly scalable platform for database as a service environment. By leveraging Oracle Enterprise Manager's self-service features, databases can be provisioned on a self-service basis to a cluster of Oracle Database Appliance machines. Discover how multiple ODA devices can be managed together to provide both high availability and incremental, cost-effective scalability. Hear real-world lessons learned from successful database consolidation implementations.
At StampedeCon 2012 in St. Louis, Pritam Damania presents: Reliable backup and recovery is one of the main requirements for any enterprise grade application. HBase has been very well embraced by enterprises needing random, real-time read/write access with huge volumes of data and ease of scalability. As such, they are looking for backup solutions that are reliable, easy to use, and can co-exist with existing infrastructure. HBase comes with several backup options but there is a clear need to improve the native export mechanisms. This talk will cover various options that are available out of the box, their drawbacks and what various companies are doing to make backup and recovery efficient. In particular it will cover what Facebook has done to improve performance of backup and recovery process with minimal impact to production cluster.
This document discusses caching strategies and techniques. It covers when and what to cache, including entire pages, page fragments, and data. It also discusses different caching mechanisms like file system, database, and in-memory caching and their pros and cons. It provides guidance on managing cache expiration policies and invalidating cached content.
An Efficient Backup and Replication of StorageTakashi Hoshino
This document describes WalB, a Linux kernel device driver that provides efficient backup and replication of storage using block-level write-ahead logging (WAL). It has negligible performance overhead and avoids issues like fragmentation. WalB works by wrapping a block device and writing redo logs to a separate log device. It then extracts diffs for backup/replication. The document discusses WalB's architecture, algorithm, performance evaluation and future work.
Presentation data domain advanced features and functionsxKinAnx
This document provides an overview of Data Domain advanced features and functions for Velocity Partner Accreditation. It covers topics such as virtual tape library (VTL) planning, snapshots, replication, recovery, DD Boost integration, capacity and throughput planning, and system monitoring tools. The document contains lessons and explanations on these topics to help partners learn about and describe Data Domain's data protection solutions.
RMAN uses backups to clone databases, which takes time and storage space. Delphix clones databases virtually by linking to a source and sharing blocks, allowing near-instant clones that use minimal storage. The document compares RMAN and Delphix approaches to cloning databases for development environments.
"SQL Server Storage Configuration for SharePoint" presented to the Silicon Valley SQL Server User Group on January 13, 2010
Presenter: Burzin Patel, author and Solutions Architect at StorSimple
Learn about the Top Five SQL Server storage configuration best practices for SharePoint, including:
•Disk sizing and configuration •Externalizing BLOB storage •Common maintenance tasks •Performance tuning
Pyramid: A large-scale array-oriented active storage systemViet-Trung TRAN
The recent explosion in data sizes manipulated by distributed scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed by a large number of clients, both for reading and writing. A specialized storage system that deals with such an access pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges and shows promising results in our initial evaluation.
This document discusses database virtualization and instant cloning technologies. It begins by outlining the challenges businesses face with growing databases and increasing demands for copies from developers, reporting teams, etc. It then covers three main parts:
1) Cloning technologies including physical cloning, thin provision cloning using file system snapshots, and database virtualization.
2) How these technologies can accelerate businesses by enabling faster development, testing, recovery and reporting.
3) Specific use cases like development acceleration through frequent, full clones; branching for rapid QA; recovery and testing capabilities; and enabling fast data refreshes for reporting.
Veeam backup Oracle DB in a VM is easy and reliable way to protect dataAleks Y
Few slides regarding why backing up Oracle DB in a VM with Veeam B&R is easy and reliable way to protect data - easy to backup, easy to recover and restore.
(see NOTES tab under presentation for more detail) An overview talk about decisions I've made so far in architecting the BHL clustered, distributed storage filesystem. Covering background on the proposal, proof of concept, to a current status report and thoughts of future implementations and uses.
The document provides an overview of storage options in Windows Azure, including Table storage, Blob storage, Queue storage, and Drive storage. It discusses the concepts and capabilities of each storage type such as structure, size limits, and access mechanisms. The document also includes demos of Azure storage features and resources for additional information.
20160922 Materials Data Facility TMS WebinarBen Blaiszik
Fall 2016 TMS Webinar on Data Curation Tools. Slides for the Materials Data Facility presentation on data services (publish and discover) as described by Ben Blaiszik. See http://www.materialsdatafacility.org for more information.
Lots of small objects in a swift cluster can lead to performance issues on the object servers. We propose a backend change to improve performance for this workload.
This document provides an overview of Oracle 12c Pluggable Databases (PDBs). Key points include:
- PDBs allow multiple databases to be consolidated within a single container database (CDB), providing benefits like faster provisioning and upgrades by doing them once per CDB.
- Each PDB acts as an independent database with its own data dictionary but shares resources like redo logs at the CDB level. PDBs can be unplugged from one CDB and plugged into another.
- Hands-on labs demonstrate how to create, open, clone, and migrate PDBs between CDBs. The document also compares characteristics of CDBs and PDBs and shows how a non-C
This document provides an overview of Oracle 9i Real Application Clusters (RAC) on Linux. It discusses the benefits of RAC such as scalability, high availability, and transparent expansion. Key components of RAC are described including cache fusion, global cache management, and resource coordination. Failure detection and recovery processes are also summarized. The document concludes with information on configuring Oracle 9i RAC and Linux kernel parameters on Linux systems.
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...Kuniyasu Suzaki
Ext2/3optimizer and readahead optimization can improve the performance of LBCAS (LoopBack Content Addressable Storage). Ext2/3optimizer rearranges file system blocks based on access profiles to reduce the number of small readahead requests. This increases the readahead coverage size, reduces redundant block downloads, and improves disk access locality. Performance analysis shows these optimizations reduce I/O utilization and consumption time in LBCAS during system booting.
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting ...Blackboard APAC
Within a year, USC have enhanced various system administration tasks. From length file and database interrogation, we are now running with a proactive instant alerting process where incidents are captured and actioned before staff and students are impacted. A number of commercial, open-source and in-house tools have been utilised to facilitate these improvements and sights are now set on shifting to self-healing incidents.
Delivered at Innovate and Educate: Teaching and Learning Conference by Blackboard. 24 -27 August 2015 in Adelaide, Australia.
Database as a Service on the Oracle Database Appliance PlatformMaris Elsins
Speaker: Marc Fielding, Co-speaker: Maris Elsins.
Oracle Database Appliance provides a robust, highly-available, cost-effective, and surprisingly scalable platform for database as a service environment. By leveraging Oracle Enterprise Manager's self-service features, databases can be provisioned on a self-service basis to a cluster of Oracle Database Appliance machines. Discover how multiple ODA devices can be managed together to provide both high availability and incremental, cost-effective scalability. Hear real-world lessons learned from successful database consolidation implementations.
At StampedeCon 2012 in St. Louis, Pritam Damania presents: Reliable backup and recovery is one of the main requirements for any enterprise grade application. HBase has been very well embraced by enterprises needing random, real-time read/write access with huge volumes of data and ease of scalability. As such, they are looking for backup solutions that are reliable, easy to use, and can co-exist with existing infrastructure. HBase comes with several backup options but there is a clear need to improve the native export mechanisms. This talk will cover various options that are available out of the box, their drawbacks and what various companies are doing to make backup and recovery efficient. In particular it will cover what Facebook has done to improve performance of backup and recovery process with minimal impact to production cluster.
This document discusses caching strategies and techniques. It covers when and what to cache, including entire pages, page fragments, and data. It also discusses different caching mechanisms like file system, database, and in-memory caching and their pros and cons. It provides guidance on managing cache expiration policies and invalidating cached content.
An Efficient Backup and Replication of StorageTakashi Hoshino
This document describes WalB, a Linux kernel device driver that provides efficient backup and replication of storage using block-level write-ahead logging (WAL). It has negligible performance overhead and avoids issues like fragmentation. WalB works by wrapping a block device and writing redo logs to a separate log device. It then extracts diffs for backup/replication. The document discusses WalB's architecture, algorithm, performance evaluation and future work.
Presentation data domain advanced features and functionsxKinAnx
This document provides an overview of Data Domain advanced features and functions for Velocity Partner Accreditation. It covers topics such as virtual tape library (VTL) planning, snapshots, replication, recovery, DD Boost integration, capacity and throughput planning, and system monitoring tools. The document contains lessons and explanations on these topics to help partners learn about and describe Data Domain's data protection solutions.
RMAN uses backups to clone databases, which takes time and storage space. Delphix clones databases virtually by linking to a source and sharing blocks, allowing near-instant clones that use minimal storage. The document compares RMAN and Delphix approaches to cloning databases for development environments.
"SQL Server Storage Configuration for SharePoint" presented to the Silicon Valley SQL Server User Group on January 13, 2010
Presenter: Burzin Patel, author and Solutions Architect at StorSimple
Learn about the Top Five SQL Server storage configuration best practices for SharePoint, including:
•Disk sizing and configuration •Externalizing BLOB storage •Common maintenance tasks •Performance tuning
Pyramid: A large-scale array-oriented active storage systemViet-Trung TRAN
The recent explosion in data sizes manipulated by distributed scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed by a large number of clients, both for reading and writing. A specialized storage system that deals with such an access pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges and shows promising results in our initial evaluation.
This document discusses database virtualization and instant cloning technologies. It begins by outlining the challenges businesses face with growing databases and increasing demands for copies from developers, reporting teams, etc. It then covers three main parts:
1) Cloning technologies including physical cloning, thin provision cloning using file system snapshots, and database virtualization.
2) How these technologies can accelerate businesses by enabling faster development, testing, recovery and reporting.
3) Specific use cases like development acceleration through frequent, full clones; branching for rapid QA; recovery and testing capabilities; and enabling fast data refreshes for reporting.
Veeam backup Oracle DB in a VM is easy and reliable way to protect dataAleks Y
Few slides regarding why backing up Oracle DB in a VM with Veeam B&R is easy and reliable way to protect data - easy to backup, easy to recover and restore.
(see NOTES tab under presentation for more detail) An overview talk about decisions I've made so far in architecting the BHL clustered, distributed storage filesystem. Covering background on the proposal, proof of concept, to a current status report and thoughts of future implementations and uses.
The document provides an overview of storage options in Windows Azure, including Table storage, Blob storage, Queue storage, and Drive storage. It discusses the concepts and capabilities of each storage type such as structure, size limits, and access mechanisms. The document also includes demos of Azure storage features and resources for additional information.
20160922 Materials Data Facility TMS WebinarBen Blaiszik
Fall 2016 TMS Webinar on Data Curation Tools. Slides for the Materials Data Facility presentation on data services (publish and discover) as described by Ben Blaiszik. See http://www.materialsdatafacility.org for more information.
1. Consistency Without Ordering
Vijay Chidambaram, Tushar Sharma,
Andrea Arpaci‐Dusseau, Remzi Arpaci‐Dusseau
The Advanced Systems Laboratory
University of Wisconsin Madison
6. Ordering points require trust
• File system runs on stack of virtual devices
– Consistency fails if any device ignores commands to
flush cache
F_FULLFSYNC “…The operaLon may take quite a while to complete.
Certain FireWire drives have also been known to ignore
the request to flush their buffered data.”
VirtualBox “If desired, the virtual disk images can be flushed when the
guest issues the IDE FLUSH CACHE command. Normally
these requests are ignored for improved performance”
2/15/12 FAST 12 6
7. Is crash‐consistency possible
without ordering points?
• Middle ground between lazy and eager approaches
• Simplicity and high performance of lazy approach
• Strong consistency and availability of eager approach
2/15/12 FAST 12 7
8. Our soluBon:
No‐Order File System (NoFS)
Order‐less file system which uses
mutual agreement between objects
to obtain consistency
2/15/12 FAST 12 8
9. Results
• Designed a new crash‐consistency technique
– Backpointer‐based consistency (BBC)
• TheoreBcally and experimentally verified that
NoFS provides strong consistency
• Evaluated NoFS against ext2 and ext3
– NoFS performance comparable to ext2
– NoFS performance equal to or beger than ext3
2/15/12 FAST 12 9
11. Crash consistency and object idenBty
All file system inconsistencies are due to
ambiguity about the logical idenLty of an object
• Logical idenBty of an object
– Data block: Owner file, offset
– File: Parent directories
• Common inconsistencies
– Two files claim the same data block
– File points to garbage data
2/15/12 FAST 12 11
12. Crash Scenario
• AcBons:
– File A is truncated
– The freed data block is allocated to File B
– The updated data blocks are wrigen to disk
• Problem: Due to a crash, File A is not updated on disk
• Result: On disk, both files claim the data block
Data
MEMORY File A File B
block
DISK Data
File A
block
2/15/12 FAST 12 12
15. Using backpointers in a crash scenario
• AcBons:
– File A is truncated
– The freed data block is allocated to File B
– The updated data blocks are wrigen to disk
• Problem: Due to a crash, File A is not updated on disk
• Result: Using the backpointer, the true owner is idenBfied
Data
MEMORY File A File B
block
DISK Data
File A
block
2/15/12 FAST 12 15
24. Determining allocaBon informaBon
ext2 NoFS
Data block Data block
Data block bitmap bitmap validity bitmap
1 0 1 0 1 0 1 1
‐ ‐ ‐ ‐ 1 1 0 1
0 0 0 0
1
File A Data block File A Data block
File B Data block File B Data block
File C Data block File C Data block
File D Data block File D Data block
2/15/12 FAST 12 24
26. Design
Memory Group descriptor Inode bitmap Data block bitmap
Inode validity bitmap Data block validity bitmap
Group descriptor Inode bitmap Data block bitmap
Disk
Directory File Data block
2/15/12 FAST 12 26
27. ImplementaBon
• Based on ext2 codebase
• Three types of backpointers
– Data block backpointers {inode num, offset}
– Inode backlinks {inode num}
– Directory block backpointers {dot directory entry}
• Inode size increased to support 32 backlinks
• Modified the linux page cache to add checks
2/15/12 FAST 12 27
29. EvaluaBon
• Q: Is NoFS robust against crashes?
– Fault injecBon tesBng
• Q: What is the overhead of NoFS?
– Evaluated on micro and macro benchmarks
• Q: How does the background scan affect performance?
– Measured write bandwidth, access latency during scan
2/15/12 FAST 12 29
30. Is NoFS robust against crashes?
Fault injecBon tesBng
• Interpose pseudo‐device driver Writes from file system
between the file system and disk
NoFS detected all inconsistencies
• Discard writes to selected sectors
Pseudo‐device
driver
• Errors returned on invalid access
• Emulate crash with different blocks Selected writes
• Orphan inodes/blocks reclaimed
successfully updated on disk
• 20 different crash scenarios Disk
2/15/12 FAST 12 30
31. What is the overhead of NoFS?
Performance in micro and macro benchmarks
ext2 NoFS ext3
Normalized throughput
1
0.8 NoFS performance comparable to ext2
vs ext2
0.6
0.4
NoFS performance is beger than ext3 for
0.2
0
sync heavy workloads
SeqWrite RandWrite File Create Varmail
Writes to 1 GB file 4088 bytes 100K files over 100 Filebench
per write to directories with
1 GB file fsync
2/15/12 FAST 12 31
32. How does the background scan
affect performance?
• Scan reads are interleaved with file system I/O
• Access to objects not verified by scan incurs a
performance penalty
2/15/12 FAST 12 32
35. Scan reads are interleaved with file system I/O
Write bandwidth obtained
70
60
Scan compleBon
Bandwidth (MB/s)
50
I/O bandwidth is reduced during scan, but peak
40
30 performance achieved on scan compleBon
20
10
0
0 30 60 90 120 150 180 210 240 270 300 330 360
Time (s)
2/15/12 FAST 12 35
36. Access to objects not verified by scan costs more
• The stat problem
– stat returns number of blocks allocated
– This informaBon might be stale for un‐verified inode
– NoFS verifies the inode upon stat
• Involves checking each inode data block
2/15/12 FAST 12 36
37. Access to objects not verified by scan costs more
• Experiment
– Create a number of directories with 128 files (each 1 MB)
– At each 50 second interval, starBng from file‐system mount
• Run ls –l on directory
• This causes a stat call on every inode
• stat on un‐verified inodes requires reading all its data
– Measure Bme taken
2/15/12 FAST 12 37
38. Access to objects not verified by scan costs more
18
Time taken for ls –l (s)
16
Scan compleBon
14
12
There is a performance cost to accessing
10
8
un‐verified objects during the scan
6
4 One Bme cost, only unBl scan compleBon
2
0
0 50 100 150 200 250 300 350 400 450 500 550 600
Time aKer file‐system mount (s)
2/15/12 FAST 12 38
40. Summary
• Problem: Providing crash‐consistency and high
availability without ordering points
• SoluBon: NoFS with Backpointer‐based consistency
– Use mutual agreement to drive consistency
• Advantages:
– Strong consistency guarantees
– Performance similar to order‐less file system
2/15/12 FAST 12 40
44. Running Mme of scan
160
140
120
100
Time (s)
80
60
40
20
0
1 2 4 8 16 32 64 128 256 512 1024
Total data in the file system (MB)
2/15/12 FAST 12 44
45. Performance cost of stat on unverified inodes
60
Total data: 128 MB
Time for ls system call (s)
50 Total data: 256 MB
Total data: 512 MB
40
Scan compleBon
30
20
10
0
0 70 140 210 250 280 350 420 490
Time (s)
2/15/12 FAST 12 45
46. Effect of background scan on write bandwidth
80
70
Write bandwidth (MB/s)
60
50
40
30
20 Writes starBng at 20s
Background scan every
10 30 seconds Writes starBng at 0s
0
0 30 60 90 120 150 180 210 240 270 300 330
Time (s)
2/15/12 FAST 12 46
47. Performance of data block scan
100
10
Time taken (s)
1
0.1
0.01
1 10 100 1000
Total data scanned (MB)
2/15/12 FAST 12 47
49. Use cases
• NoFS provides crash‐consistency without
ordering
• BBC can be used in convenBonal file systems to ensure
runBme integrity
• NoFs can be used as local file system in GFS, HDFS
• NoFS allows virtual machines to maintain
consistency without trusBng lower‐layer
primiBves
2/15/12 FAST 12 49