Storage is changing. We need new algorithms to dealwith it.We are witnessing at least two revolutions in storage: (1) massive datasets and workloads, and (2) the rise ofscale-out commodity hardware. This whitepaper describes the Acunu Data Platform, and how Acunu is allowingmassive data workloads to take full advantage oftoday’s hardware.Acunu is rewriting the storage stack in the Linux ker-nel for Massive Data thanks to world-class engineer-ing and algorithms research.Massive Data Workloads.How have workloads changed? The workloads de-manded by hardware of massive datasets typicallyexhibit three main features:• Continuously high ingest rates (many thousands of updates/s, typically high-entropy, random updates)• Individual pieces of data are small, and aren’t valu- able in isolation (for example, stock ticks or ses- sion IDs)• Continual range queries are important for analyt- ics (such as demanded by Apache Hadoop)This is in stark contrast to the ‘load, then query’regimes of more traditional databases.Understanding massive data means being able toextract features and trends, all the time while thedata is continually updated. Existing platforms andsolutions cannot do this at scale, with predictablyhigh performance. This is where Acunu comes in.The ﬁrst revolution is the rise of non-relational, or‘nosql’ data bases such as Cassandra, and analyt-ics frameworks and tools such as Hadoop. The driving force is using clusters of commodity machines to ingest largevolumes of data, process it, and serve it. Previous technologies such as mysql are traditionally cumbersome to operateat the scales needed here. For many deployments in both enterprise and non-enterprise settings, these technologiesare likely to account for the majority of data stored where features such as high availability at low cost are more impor-tant than transactional durability.The second revolution is a hardware one. Commodity machines now typically possess many cores, and bear closerresemblence to a supercomputer of the 90s than a desktop of the same era. Hard drive capacity and sequential band-width has been doubling every 18 months, as predicted; yet random IO performance has not improved. Solid-statedrives (SSDs) offer 2-3 orders of magnitude better random IO performance than hard drives. Clearly these have hugepotential to revolutionize the database world, if only the software stack can harness and utilize their performance.
Fundamental research = new possibilities.The Acunu Storage Core is based on fundamental, patent-pending, algorithms and engineering research. This isn’t justa better implementation of an existing idea, or about a shinier UI or management console (although our managementstack is also pretty cool). We are doing world-class research, engineering, patenting, and we publish at top confer-ences. Why? This allows us to do things simply notpossible before. Here are some examples.Fast, full versioning.Versioning of large data sets is an incredibly powerfultool. Not just low-performance snapshots for back-ups, but high-performance, concurrent-accessibleclones and snapshots of live datasets for test anddevelopment, offering many users different, writeable,views of the same large dataset, going back in time,and much more.Traditionally, the state-of-the-art in algorithms for ver-sioning large data sets is based on a data structureknown as the ‘copy-on-write B-tree’ (CoW B-tree) -this is ubiquitous in ﬁle systems and databases in-cluding ZFS, WAFL, Btrfs, and more. The CoW B-tree (and most of its variants, such as append-only trees, log ﬁle sys-tems, redirect-on-write, etc.) has three fundamental problems - (1) it is space-inefﬁcient (and thus requires frequentgarbage collection); (2) it relies on random IO to scale (and thus performs poorly on rotational drives); and (3) it cannotperform fast updates, even on SSDs.Acunu has invented a fundamentally new data structure - the Stratiﬁed B-tree - that addresses all the above problems.Some details of this revolutionary data structure have been published: see [Twigg, Byde - Stratiﬁed B-trees and ver-sioned dictionaries, USENIX HotStorage’11].Designed for SSDsExisting storage schemes do not address the fact that SSDs require addressing in a fundamentally different way. Al-though they present a SATA/SAS interface and are sector-addressed, this is only to allow them to be a drop-in replace-ment for hard drives. Extracting maximum performance and lifetimes requires two things: (1) the storage stack to un-derstand how they operate; and (2) new data structures and algorithms that exploit their design characteristics.By understanding how SSDs fundamentally work, Acunu has been able to engineer data structures that allow unprece-dented long-term write performance, while guaranteeing device endurance.Not just peak performance, but predictable performance.By eliminating JVM-based garbage collection and memory management issues, and carefully controlling hardware ac-cess from within the Linux kernel, Acunu is able to offer predictably high performance, even under sustained high loads,with both ingest and analytic range queries - the perfect ingredients for any real-time analytics platform. Watch carefullyin future versions as Acunu begins to deploy fundamentally new offerings here, exploiting our back-end algorithmicadvantage.
SSDs - it’s all about endurance.Flash SSDs are a fundamental change in storage technology, yet many systems treat them as if they were rotating harddrives. Indeed, the legacy storage stack is ﬁlled with implicit assumptions about rotational drives. To exploit SSDs fully,we need new algorithms and a stack that understands how ﬂash SSDs fundamentally work.What’s the problem?Let’s start by considering why in-place updates to B-trees fail to give good performance on SSDs. The ﬁgure belowshows what happens to a fresh Intel X25M Flash SSD  under a simple workload: write a random 512KB buffer to arandom 512KB-aligned offsets. The device’s stated capacity is 160GB, and around this point the performance drops offdramatically. The take-away message is this: to get consistently high performance from this device, we need to dosomething else. B-trees, or any other random-write-intensive data structure won’t work.The reason for the drop off once thewrite volume reaches the device ca-pacity is quite complex, and dependson the internal structure of the device— if you’re interested, read this greatreport  for a simulation-basedanalysis of different SSD architec-tures. The basic reason is that al-though the ﬂash memory chips have a512KB erase block, most SSDs im-plement an internal log structure (themagic ‘ﬂash translation layer’ or FTL)for several reasons, most notably be-cause the bandwidth of these individ-ual memory chips is relatively verylow, and to enable wear leveling anderror correction. This often makes the”’effective”’ logical erase block sizemuch larger, typically around 100s of MBs for recent MLC devices. The result is that writes are at the mercy of the de-vice’s FTL, which is the part manufacturers keep quiet and closed.Log ﬁle systems.Many emerging ﬁle systems and storage products argue that append-only B-trees are perfectly suited to today’s hard-ware, particularly SSDs. Is this true? The append-only B-tree has two major problems, which Acunu’s fundamental al-gorithms research ﬁnally overcomes.The CoW B-tree has a potentially big space blowup: to rewrite a 16-byte key/value pair in a tree of depth 3 with 256Kblock size, you may have to do 3x256K random reads and then write 768K of data. In practice, some of these nodesare cached and don’t need rewriting, but for random updates to large datasets, this is pretty close. Even if you don’tcare about space utilisation, when the device is full, you’ll be writing, on average, a lot of data per small random update,and this means you’re no longer fast at writing. Unfortunately, other than heuristic tweaks or giving your machine gigan-tic amounts of RAM, this is an inherent problem for append-only CoW indexes.The classic Achilles heel of a log ﬁle system is garbage collection (cleaning) — recovering invalidated (e.g. overwritten)blocks in order to reclaim sufﬁciently large contiguous regions of free space so that future writes can be efﬁcient. Veryfew guarantees are known for garbage collection in log ﬁle systems, particularly when the system does not experienceidle time, or is under low free space conditions. To make matters worse, the space blowup described above means that
CoW trees generate a lot of extra work for the garbage collector — at a 50x space blowup, the garbage collector has towork 50x harder to keep ahead of the input stream.Soules et al. (2003)  compare the metadata efﬁciency of a versioning ﬁle system using both CoW B-trees and a struc-ture (CVFS) based on the Multi-version B-tree (MVBT) . They ﬁnd that, in many cases, the size of the CoW metadataindex exceeds the dataset size. In one trace, the versioned data occupies 123GB, yet the CoW metadata requires152GB while the CVFS metadata requires 4GB, a saving of 97%.Stratiﬁed B-trees.Acunu has invented a fundamentally new data structure, the Stratiﬁed B-tree [5,6], that dominates CoW B-trees, with orwithout log ﬁle systems. They can be written without append-only logs and heuristic-based garbage collectors. Theyare the ﬁrst data structure to offer provably optimal performance for full versioning (allowing updates in far less than 1IO per update on average), use asymptotically optimal O(N) space, offer an optimal range of trade-offs between up-dates and queries, and can generally avoid performing random IO for both updates and range queries. In particular, oneconstruction offers updates three orders of magnitude faster than CoW B-trees, and can answer range queries aroundone order of magnitude faster than the CoW B-tree! Model Number: INTEL SSDSA2M160G2GC, Firmware Revision: 2CV102M, writes use Linux AIO direct to devicewith queue depth 32. http://research.microsoft.com/apps/pubs/?id=63596 http://www.hpl.hp.com/personal/Craig_Soules/papers/fast03.pdf http://portal.acm.org/citation.cfm?id=765851.765854 A Twigg et al., Stratiﬁed B-trees and versioned dictionaries, USENIX HotStorage’11, 2011. A Byde, A Twigg, Stratiﬁed B-trees and versioned dictionaries (version with proofs), arXiv.org, 2011.
About Acunu.Acunu is reengineering the storage stack from the ground-up for the age of Massive Data. Based on fundamental algo-rithms research and world-class engineering, the Acunu Platform allows applications such as Apache Cassandra andHadoop, along with many others, to (1) drive today’s commodity hardware harder than ever before, including many-corearchitectures, SSDs and large SATA drives; (2) exploit new features in the Acunu Core (such as fast cloning and version-ing); and (3) obtain predictable, reliable high performance. Storage is the key to understanding Massive Data, and gain-ing competitive advantage. The Acunu Open Platform lets companies do this quicker, easier and cheaper.Acunu was founded in 2009 by researchers and engineers from Cambridge, Oxford, and several well-known high-techcompanies. We are backed by some of Europe’s top VCs, with total funding over $5.0M. We are based in London andCalifornia.Founders.Dr Tim Moreton, CEO: Tim is an expert in distributed ﬁle systems. He holds a PhD from Cambridge, where he built adistributed ﬁle system for the Xen project. He was previously at Tideway (now BMC), where he was lead engineer on anumber of data center projects.Dr Andy Twigg, CTO: Andy has an outstanding track record of theoretical and applied computing research. He hasheld positions at Cambridge University, Microsoft Research, Thomson Research and Oxford University. His PhD in 2006on compact routing algorithms was nominated for the BCS Best Dissertation Award. He holds a Junior Research Fel-lowship at Oxford University, where he is a member of the CS department.Tom Wilkie, VP Engineering: Tom was one of the ﬁrst UK employees at XenSource before its acquisition by Citrix in2007. He worked on the XenCenter management stack and numerous customer projects. He has a BA in ComputerScience from Cambridge.Dr John Wilkes, Technical Advisor: John is an advisor to Acunu. John led the Storage Systems group at HP Labs for15 years, before moving to Google in 2008. John received his PhD from Cambridge in 1984, an Outstanding Contribu-tion award from SNIA in 2001 and was made an ACM Fellow in 2002.