A White paper by CopperEye Ltd March, 2010
Flash disk has now become a commonplace replacement for conventional hard disk storage.
The vendors of these products have invested significant time and effort in providing a plug and
play replacement for standard disks by emulating the hard disk drive interfaces. While the
interfaces are the same, flash disk performance is expected to be radically superior to hard disk
technology and yet can be disappointingly similar to hard disks for certain workloads.
This paper discusses the characteristics of flash technology and explores how CopperEye
technology can yield the best performance for indexes hosted on flash storage.
Flash Memory Performance
Vendors of flash storage regularly claim random I/O rates of hundreds of thousands of random
operations per second. Compared to a fast hard-disk drive - which can achieve about 200
random seeks per second - it would appear that flash disk is many orders of magnitude faster
than hard disk. Therefore, it might be imagined that plugging in a flash drive as a replacement
for a hard drive would yield a performance boost of at least a hundred fold.
What vendors make less obvious is that flash memory is extremely asymmetrical in its behavior
- a random write can take much longer than a random read of the same size. Random 4KB
writes are currently only twice that of a good hard disk whereas random reads are typically 500
times faster. In other words, random reads are blazingly fast while random writes are quite the
opposite. Similarly, for transfer rates experienced with sequential access IO - the read transfer
rates for flash disk are excellent while write transfer rates are closer to those found with a good
Flash Memory Structure
To understand why a flash disk is so skewed in its read/write characteristics, it is necessary to
look at the structure of flash memory and the limitations that arise.
NAND flash memory is the most common replacement for disk storage because of its faster
write-time, greater density and lower-cost than other alternatives. This type of flash memory is
composed of blocks which are divided into pages. The sizes used for blocks and pages vary
depending on the overall size of the chip, but a page size of 4KB and a block size of 2MB are
typical in today’s larger chips.
When reading from the chip, the read can be rapidly performed at the page level (4KB) and
each page can be randomly and individually addressed in a matter of microseconds which
yields extremely fast random read rates.
But writes are much more complicated for NAND memory. Firstly, an individual bit can only be
updated in one direction (from 1 to 0) and bits must be updated in sequential order within a
page – if you wish to alter a bit in the opposite direction (from 0 to 1), the whole page must be
erased and re-written. Worst still, erasure operations can only occur at the block level and these
take a few milliseconds of time. Therefore, to update a single bit within a page requires another
block to be erased and all unchanged pages in the original block to be copied into the newly
erased block; with the single updated page being modified during its copy.
This is why a small random write is so much slower than a small random read and is closer in
performance to a hard disk. Indeed, an empirically fast erasure time of 2ms limits the chip to a
write rate of no more than 500 write operations per second. Even if those write operations are
large and transfer a whole block (2MB say) per write, then the theoretical sequential write
transfer rate is at best 1GB/s while experience shows that practical rates are actually much less
CopperEye indexing for Flash Disk 2
A White paper by CopperEye Ltd March, 2010
As an aside, it is interesting to note that a new flash drive will initially perform extremely well for
random writes - because it has a stock of empty blocks that do not require erasure prior to the
copy of an updated page. But over time the stock of empty blocks becomes exhausted because
every block used subsequently requires erasure before it can be used again.
As already mentioned, page and block sizes are linked to chip size with larger block sizes found
in larger chips and therefore it can be expected that performance characteristics will become
even more asymmetrical as chips grow ever larger.
Random I/O problem
We can see that any storage structure that exerts a workload of small random writes against a
flash disk will suffer poor performance when compared to a structure that only requires large
write operations. Indeed, optimal performance can only be achieved when the average write
operation size is a multiple of the flash memory block size.
For example, hosting a B-Tree index on flash disk would likely only improve its update
performance by two-fold when compared to a hard disk drive. This is a lot less than might be
expected from flash storage at first glance and arises because each random key insert/update
or delete operation emerges as one or more small independent random write requests.
Of course, it is possible to cache the index structure in main memory to accumulate the random
writes and then flush the index to disk as large sequential writes. But caching requires a huge
amount of memory for a large index; and besides, this caching strategy is equally applicable to
a hard disk too.
In other words, flash memory cannot offer a significant performance benefit over hard disk for
any structure which naturally requires small random write IO operations to update it; because
whatever strategy is employed to coalesce write operations - that strategy is also equally
applicable to hard disk.
The CopperEye Index Solution
The CopperEye index technology has already solved the problem of eliminating small random
write IO for index maintenance and is equally applicable to flash disk as it is to hard disk.
Indeed, CopperEye index technology is potentially even better suited to flash disk than hard
disk because it can benefit from the fast random read rate of flash disk to yield extremely fast
A CopperEye index is block structured and, unlike a B-Tree, can support extremely large block
sizes. This allows a CopperEye index to work with block sizes that are a multiple of the flash
memory size. In fact, the characteristics of the technology imply that an optimal CopperEye
index block size would be eight times that of the flash memory block; which could yield an index
performance of the order of hundreds of thousands of key inserts per second per flash disk -
about 200 times better than an equivalent B-tree index on the same flash disk. Queries would
be extremely fast too and could be many hundred times faster than those experienced with a
conventional hard disk.
CopperEye indexing for Flash Disk 3