Acunu: Understanding Massive Data.We are witnessing at least two revolutions in storage: (1) massive datasets and workloads, and (2) the rise ofscale-out commodity hardware. This whitepaper describes the Acunu Data Platform, and how Acunu is allowingmassive data workloads to take full advantage oftoday’s hardware.Acunu is rewriting the storage stack in the Linux ker-nel for Massive Data thanks to world-class engineer-ing and algorithms research.Massive Data Workloads.How have workloads changed? The workloads de-manded by hardware of massive datasets typicallyexhibit three main features:• Continuously high ingest rates (many thousands of updates/s, typically high-entropy, random updates)• Individual pieces of data are small, and aren’t valu- able in isolation (for example, stock ticks or ses- sion IDs)• Continual range queries are important for analyt- ics (such as demanded by Apache Hadoop)This is in stark contrast to the ‘load, then query’regimes of more traditional databases.Understanding massive data means being able toextract features and trends, all the time while thedata is continually updated. Existing platforms andsolutions cannot do this at scale, with predictablyhigh performance. This is where Acunu comes in.The ﬁrst revolution is the rise of non-relational, or‘nosql’ data bases such as Cassandra, and analyt-ics frameworks and tools such as Hadoop. The driving force is using clusters of commodity machines to ingest largevolumes of data, process it, and serve it. Previous technologies such as mysql are traditionally cumbersome to operateat the scales needed here. For many deployments in both enterprise and non-enterprise settings, these technologiesare likely to account for the majority of data stored where features such as high availability at low cost are more impor-tant than transactional durability.The second revolution is a hardware one. Commodity machines now typically possess many cores, and bear closerresemblence to a supercomputer of the 90s than a desktop of the same era. Hard drive capacity and sequential band-width has been doubling every 18 months, as predicted; yet random IO performance has not improved. Solid-statedrives (SSDs) offer 2-3 orders of magnitude better random IO performance than hard drives. Clearly these have hugepotential to revolutionize the database world, if only the software stack can harness and utilize their performance.
Acunu’s proposition - reengineering the stack for massive data.These two revolutions expose a newproblem. The “storage stack” thatabstracts away details of the hard-ware and allows applications tocommunicate with the hardware, isnow a serious bottleneck. It was builtfor the needs of databases and hard-ware of the 90s. The result is that itpresents fundamentally the wrongabstraction for Massive Data applica-tions, which developers either workaround or accept, and secondly, itsimply cannot be easily modiﬁed totake advantage of new storage tech-nologies - the assumptions underlyingrotational drives are implicit through-out it.Acunu is taking the difﬁcult, but fun-damental, step of reengineering thestorage stack for the age of MassiveData. We’ve thrown almost 30 world-class engineers, including over 10 PhDs, mathematicians, Cambridge, Oxford andStanford academics at the problem. The result is a set of core components, rearchitected from the ground up.Why is this important? “It’s disruptive if it’s a 10x beneﬁt, because that’s a platform for creating opportunities for new ecosystems.” - Reid Hoffman, Data as Web 3.0 (SXSW 2011)By revisiting the core storage stack,Acunu is able to provide a platform forMassive Data applications. This al-lows us to do things such as improveApache Cassandra performance byalmost 100x for heavy workloads, giveit predictable performance (removingmemory garbage collection prob-lems), support SSDs with high writewrites and guaranteed endurance,interoperate simultaneous MassiveData stores (do you want to ingest viamemcached and analyze via Cassan-dra?), offer fundamentally new fea-tures (such as full versioning - snap-shots and clones - while doing fastinserts) via patented algorithms, andlots more.We don’t need yet another database - we need a ﬁrm foundation on which to understand massive data.
Acunu Data Platform.The Acunu Data Platform is a powerful storage solution that brings simpler, faster and more predictable performance toNOSQL stores like Apache Cassandra.Our view is that the new data intensiveworkloads that are increasingly com-mon are a poor match for the legacystorage systems they tend to run on.These systems are built on a set ofassumptions about the capacity andperformance of hardware that are sim-ply no longer true. The Acunu DataPlatform is the result of a radical re-think of those assumptions; the result ishigh performance from low cost com-modity hardware.Open Storage Core.The Acunu Storage Core is an open-source, in-kernel, industrial-strength,write-optimized, multi-dimensional,fully-versioned, key-value store. It con-tains the majority of our techniques that provide extremely high, predictable performance. It is open-source underGPLv2, and can be downloaded for free from www.acunu.com.Interoperability of multiple data stores.By running on the Acunu Data Platform, we are able to allow multiple data stores to interoperate. For example, applica-tions can write to the store using memcached (running on Acunu), and then perform analysis on the same data usingApache Cassandra, or the Hadoop framework (running on Acunu). Using Acunu’s versioning and advanced isolationtools, views of large data sets can be updated atomically and isolated from one another.Powered by Acunu.We provide user-level client libraries to allow applications to run on the Storage Core. Typically, a small patch or pluginis required in order for the application to use the Acunu client libraries. Version 1 ships with the Acunu Distribution forApache Cassandra, and a large object store that talks the same protocol as Amazon’s S3 store, based on ProjectVoldemort. As time goes on, we will release more patches, and we will look to the community to contribute patches forvarious projects. We will make all these open, and freely-available.
Monitor and control the entire stack, overthe whole cluster.To make all of this easier to use, we have alsoproduced some snazzy management tools.These are web-based and follow the same de-centralized model of Cassandra: simply pointyour web browser at any of the boxes runningAcunu’s software and you will be able to createa cluster, do snapshots and clones, or see whatis happening across your Acunu storage nodes.Since the Acunu platform replaces the ﬁle sys-tem and page cache, it has direct hardwareaccess and unprecedented hardware visibility.This means that Acunu’s monitoring tools canobserve and directly control such things as diskqueues, latencies throughout the stack, and much more. One can quickly diagnose hardware bottlenecks, and inefﬁ-ciencies up and down the stack, across the entire cluster.
Fundamental research = new possibilities.The Acunu Storage Core is based on fundamental, patent-pending, algorithms and engineering research. This isn’t justa better implementation of an existing idea, or about a shinier UI or management console (although our managementstack is also pretty cool). We are doing world-class research, engineering, patenting, and we publish at top confer-ences. Why? This allows us to do things simply notpossible before. Here are some examples.Fast, full versioning.Versioning of large data sets is an incredibly powerfultool. Not just low-performance snapshots for back-ups, but high-performance, concurrent-accessibleclones and snapshots of live datasets for test anddevelopment, offering many users different, writeable,views of the same large dataset, going back in time,and much more.Traditionally, the state-of-the-art in algorithms for ver-sioning large data sets is based on a data structureknown as the ‘copy-on-write B-tree’ (CoW B-tree) -this is ubiquitous in ﬁle systems and databases in-cluding ZFS, WAFL, Btrfs, and more. The CoW B-tree (and most of its variants, such as append-only trees, log ﬁle sys-tems, redirect-on-write, etc.) has three fundamental problems - (1) it is space-inefﬁcient (and thus requires frequentgarbage collection); (2) it relies on random IO to scale (and thus performs poorly on rotational drives); and (3) it cannotperform fast updates, even on SSDs.Acunu has invented a fundamentally new data structure - the Stratiﬁed B-tree - that addresses all the above problems.Some details of this revolutionary data structure have been published: see [Twigg, Byde - Stratiﬁed B-trees and ver-sioned dictionaries, USENIX HotStorage’11].Designed for SSDsExisting storage schemes do not address the fact that SSDs require addressing in a fundamentally different way. Al-though they present a SATA/SAS interface and are sector-addressed, this is only to allow them to be a drop-in replace-ment for hard drives. Extracting maximum performance and lifetimes requires two things: (1) the storage stack to un-derstand how they operate; and (2) new data structures and algorithms that exploit their design characteristics.By understanding how SSDs fundamentally work, Acunu has been able to engineer data structures that allow unprece-dented long-term write performance, while guaranteeing device endurance.Not just peak performance, but predictable performance.By eliminating JVM-based garbage collection and memory management issues, and carefully controlling hardware ac-cess from within the Linux kernel, Acunu is able to offer predictably high performance, even under sustained high loads,with both ingest and analytic range queries - the perfect ingredients for any real-time analytics platform. Watch carefullyin future versions as Acunu begins to deploy fundamentally new offerings here, exploiting our back-end algorithmicadvantage.
V1: Supercharging Apache Cassandra with Acunu.A major feature of the Acunu Storage Core is its predictably high performance, even under sustained heavy load. Often,this is more important than absolute peak performance ﬁgures - if you know what to expect from a node, then you canadd nodes to get the desired performance level. On the other hand, if performance is unpredictable, how many nodesshould you use?The graphs below show the difference between Acunu’s Distribution for Cassandra and Vanilla Apache Cassandra, un-der a sustained heavy load of 50k inserts/second. It is easy to see the advantage of Acunu - the worst-case latency isnever worse than 18ms, whereas for Apache Cassandra it often exceeds 10,000ms.Next, we consider the performance of rangequeries under sustained insert load. Immedi-ately after performing the inserts above, weattempted to perform a large sequence of smallrange queries, simulating a real-time analyticsworkload.The graph on the right shows the result. WithAcunu, Cassandra was able to sustain over 40range queries per second (this is an area wehave not optimized for V1, and will dramaticallyimprove in a later release). Apache Cassandra,by contrast, manages about 0.3 queries persecond. After about 1 hours, this improvesslightly since we manually triggered a ‘majorcompaction’ (in practise, this is not possibleduring sustained inserts).
Licensing, Pricing, Support.At launch, the Acunu Data Platform will come in two ﬂavors: Enterprise Edition: The full Acunu stack, with either regular (5x8) or premium (24x7) support via phone, email and web at support.acunu.com. Please contact firstname.lastname@example.org for details. Standard Edition: Same as Enterprise Edition, but limited to 2 nodes, with mailing list / community support. Free for production use.Tested and supported.Whatever edition and level of support you opt for, we are committed to making sure the product you use is rock-solid,and ready for prime-time production use. Unlike other vendors of open-source software, the free version and enterpriseversion of the Acunu Storage Core are the same thing, both builds subjected to the same rigorous testing and QA, in-volving over 300 machine-hours of tests per build. Even if you use the Standard Edition, we provide detailed supportthrough user and developer mailing lists. For the Enterprise Edition, we offer unparalleled access to our team of supportengineers, and world-class engineers and PhDs via support.acunu.com.Open-source.We recognize the importance of the open source community in developing, maintaining, innovating and educatingaround complex and fundamental software projects. We also recognize that, in order to become strongly adopted, ourmost fundamental code should be open for anyone to examine and improve. That’s why we’re making the Acunu Stor-age Core open-source, under the GPLv2. All our our contributions to Apache Cassandra and other open-source pro-jects will be released under the appropriate licenses, too. The rest of the Acunu Data Platform, including the enterprise-grade management and monitoring tools, and additional performance packs, will be released in due course.Community.Acunu is committed to contributing back to the open-source communities for the products we use, and to leveragetheir ability to strengthen and develop our own open-source projects (such as the Acunu Storage Core and others com-ing in the future). We welcome all contributions and developments from the community.
About Acunu.Acunu is reengineering the storage stack from the ground-up for the age of Massive Data. Based on fundamental algo-rithms research and world-class engineering, the Acunu Platform allows applications such as Apache Cassandra andHadoop, along with many others, to (1) drive today’s commodity hardware harder than ever before, including many-corearchitectures, SSDs and large SATA drives; (2) exploit new features in the Acunu Core (such as fast cloning and version-ing); and (3) obtain predictable, reliable high performance. Storage is the key to understanding Massive Data, and gain-ing competitive advantage. The Acunu Open Platform lets companies do this quicker, easier and cheaper.Acunu was founded in 2009 by researchers and engineers from Cambridge, Oxford, and several well-known high-techcompanies. We are backed by some of Europe’s top VCs, with total funding over $5.0M. We are based in London andCalifornia.Founders.Dr Tim Moreton, CEO: Tim is an expert in distributed ﬁle systems. He holds a PhD from Cambridge, where he built adistributed ﬁle system for the Xen project. He was previously at Tideway (now BMC), where he was lead engineer on anumber of data center projects.Dr Andy Twigg, CTO: Andy has an outstanding track record of theoretical and applied computing research. He hasheld positions at Cambridge University, Microsoft Research, Thomson Research and Oxford University. His PhD in 2006on compact routing algorithms was nominated for the BCS Best Dissertation Award. He holds a Junior Research Fel-lowship at Oxford University, where he is a member of the CS department.Tom Wilkie, VP Engineering: Tom was one of the ﬁrst UK employees at XenSource before its acquisition by Citrix in2007. He worked on the XenCenter management stack and numerous customer projects. He has a BA in ComputerScience from Cambridge.Dr John Wilkes, Technical Advisor: John is an advisor to Acunu. John led the Storage Systems group at HP Labs for15 years, before moving to Google in 2008. John received his PhD from Cambridge in 1984, an Outstanding Contribu-tion award from SNIA in 2001 and was made an ACM Fellow in 2002.