Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Accumulo 1.8.0 Overview


Published on

This talk will be an overview of the new features and improvements currently implemented for the Apache Accumulo 1.8.0 release. This will be a discussion about some of these exciting changes with a focus on what is of the most importance for users.

Published in: Software
  • Be the first to comment

Apache Accumulo 1.8.0 Overview

  1. 1. Apache Accumulo 1.8.0 Overview Josh Elser Apache Accumulo Meetup Group 2016/06/27
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Accumulo 1.8.0  First release candidate in the works  A “minor” release, but significantly more work required than a “patch” release – ContinuousIngest and verification – RandomWalk  Long time coming..
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Semantic Versioning  Defines a set of rules for software projects to adhere to across different versions.  Clear understanding on compatibility  Rules are defined in terms of a “public API” – Defined by the project adopting SemVer  Major – Incompatible changes, deprecations removed  Minor – Backwards-compatible features added  Patch – Backwards-compatible bug-fixes only (no features) - major.minor.patch
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Accumulo and Semantic Versioning  Apache Accumulo defines a public API – Made up of Java classes, defined by packages – The goal is to describe how user code should function across releases – Recursively, all public types in (excluding impl, thrift, or crypto) • org.apache.accumulo.core.{client,data,security} • org.apache.accumulo.minicluster  Other concerns for compatibility too – RPC classes – Persistent data (RFiles and ZooKeeper)  Not comprehensive! – Not all user facing code is yet included in the public API • Monitoring UIs and data • Start/stop scripts • The Accumulo Shell
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Accumulo and Semantic Versioning  Is it guaranteed that your application from 1.7.1 work against 1.8.0?  What about a 1.6.5 application?  Are you guaranteed to be able to roll back an upgrade from 1.8.0 to 1.7.1?  Is it guaranteed that your 1.8.0 application work against 1.7.0? POP QUIZ!
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Notable changes currently staged for Apache Accumulo 1.8.0
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved System Administrator Changes  [ACCUMULO-925] - Launch scripts should use a PIDfile – New script: – Encapsulates only the things that need to happen on the machine starting a process • No SSH’ing – Support for PID files to track processes – Rotating .out and .err files on start • Critical for delayed JVM layer issues
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance!  [ACCUMULO-3423] - Speed up write-ahead log (WAL) roll-overs – Changes how references to WALs are stored by Accumulo – Reduces the number of writes when switching to a new WAL – Uses ZooKeeper to track the state, copies into tablet row before recovery starts – 10-30% faster over previous implementation (while exacerbating the problem)  [ACCUMULO-1124] - Optimize index size in RFile – RFiles have “data” and “index” blocks; index from RowID to data block containing that RowID – Large RowIDs bloat the index (e.g. inverted URL) – Fewer index blocks can be cached – Related work: [ACCUMULO-4164] and [ACCUMULO-4314]
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved New Features  [ACCUMULO-3913] - Add per table sampling – Helpful in running analytics over some percentage of the total data – Can automatically create samples during compaction or on the fly using Iterators – Configurable hashing to ensure consistency across “index” and “data” tables • No dangling references index records or unreachable data records – Consider snapshot’ing a sample of a table. After compaction, just a “normal” table  [ACCUMULO-4187] - Rate limiting of major compactions – Compactions can strain system resources: hardware, JVM and HDFS – Normally, desirable to process compactions as fast as possible – Can negatively affect low-latency workloads – Configure a limit in bytes per seconds that a TabletServer should process during compaction
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved New Features (pt.2)  [ACCUMULO-3948] - Enable A/B testing of scan iterators on a table – Classpath context is a definition of JARs which the TabletServer should dynamically load – Configuration allows a context to be specified when using a [Batch]Scanner – Multiple implementations of the same SKVIterator classes can co-exist – Useful in testing new implementations of iterators on real data before switching production  [ACCUMULO-626] - Create an iterator fuzz tester – Writing SKVIterators is notoriously difficult – Many common pitfalls and gotcha’s, often not appearing until “real” use – A testing framework codifies these edge cases and can automatically test iterators • Similar to ”security fuzzing” – Users must provide data sets and the expected outcome from using their SKVIterator – A supplement to unit testing and MiniAccumuloCluster, not a replacement – Test cases implicitly encourage good design of SKVIterators
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved New APIs  [ACCUMULO-2883] - Add API to fetch current locations of Tablets – Long-standing feature request (order of years) – Extremely useful for distributed execution engines for locality aware computation • Apache Hive, Presto, Apache Drill, Apache Spark, etc – Smart placement can reduce client <--> Accumulo network traffic • Locality with Accumulo Tablets also implies locality with HDFS data (over time)  [ACCUMULO-4165] - Create a user level API for RFile – Example of a “glaring” hole in the public API – Only stable way to create an RFile is via MapReduce – Provides a supported API for reading and writing RFiles – Simplifies implementation and use of RFile access internally too
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Changes to be wary of  [ACCUMULO-3409] - Move default ports out of ephemeral range – Traditional ephemeral range on Linux: [32768, 61000] – Transient connections can prevent processes from starting – Monitor HTTP port moves from 50095 to 9995  [ACCUMULO-4077] - Upgrade to Apache Thrift 0.9.3 – Thrift is used by Accumulo for RPCs – Serialized messages are compatible (with caveats) across releases, but Java classes are not – A massive pain for downstream integrations – If you require a different version of Thrift and want to use Accumulo 1.8.0 • Shade+Relocate your version of Thrift in your application • Upgrade to Apache Thrift 0.9.3
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You Email: Twitter: @josh_elser Mailing list: