Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo Interface

Introducing Accumulo Collections:
A Practical Accumulo Interface
By Jonathan Wolff
jwolff@isentropy.com
Founder, Isentropy LLC
https://isentropy.com
Code and Documentation on Github
https://github.com/isentropy/accumulo-collections/wiki

Accumulo Needs A Practical API
● Accumulo is great under the hood, but needs a practical
interface for real-world NoSQL applications.
● Could companies use Accumulo in place of MySQL??
● Accumulo needs a layer to:
1) Handle java Object serialization locally and on tablet servers
2) Handle foreign keys/joins.
3) Abstract iterators, so that it's easy to do server-side
computations.
4) Provide a useful library of filters, transformations, aggregates.

What is Accumulo Collections?
● Accumulo Collections is a new, alternative NoSQL framework that
uses Accumulo as a backend. It abstracts powerful Accumulo
functionality in a concise java API.
● Since Accumulo is already a sorted map, java SortedMap is a
natural choice for an interface. It's already familiar to java
developers. Devs who know nothing about Accumulo can use it to
build giant, responsive NoSQL applications.
● But Accumulo Collections is more than a SortedMap
implementation...
● Many features are implemented on the tablet servers by iterators,
and wrapped in java methods. You don't need to understand
Accumulo iterators to use them.

AccumuloSortedMap wraps an
Accumulo table
● AccumuloSortedMap is a java SortedMap implementation that is backed by
an Accumulo table. It handles object serialization and foreign keys, and
abstracts powerful iterator functionality.
● Method calls derive new maps that contain transformations and aggregates.
Derived maps modify the underlying Scanner. This abstracts the concept of
iterators. Derived map methods run on-the-fly and can be chained:
// similar to SQL: WHERE timestamp BETWEEN t0 AND t1 AND rand() > .5
AccumuloSortedMap derivedMap = map.timeFilter(t0,t1).sample(0.5);
// statistical aggregate (mean, sd, n, etc) of values from key range [100,200)
StatisticalSummary stats = map.submap(100, 200).valueStats();
Each of the above methods stacks an iterator on the underlying map. The
iterators make use of SerDes to operate directly on java Objects.

Just like a standard java
SortedMap, but…
● AccumuloSortedMap returns a copy of the map value.
You must put() to save modifications.
● To use sorted map features, the SerDe used must
serialize bytes in same sort order as java Objects.
The default FixedPointSerde is suitable for most
common keys types (strings, primitives, byte[], etc).
More about SerDes later…
● Supports sizes greater than MAX_INT. See
sizeAsLong().
● Can be set to read-only. Derived map methods, which
stack scan iterators, always return read-only maps.

Use Accumulo as a SortedMap
AccumuloSortedMapFactory factory = new AccumuloSortedMapFactory(conn,"factory_name");
AccumuloSortedMap<Long,String> map = factory.makeMap("mapname");
for(long i=0; i<1000; i++){
map.put(i, "value"+i);
};
map.get(123); // equals “value123”
map.keySet().iterator().next(); // equals 0
AccumuloSortedMap submap = map.subMap(100, 150);
submap.size(); // equals 50
submap.firstKey(); // equals 100
submap.keyStats().getSum(); // equals 6225.0
for(Entry<Long,String> e : submap.entrySet()){ // iterate };
// these commands throws Exceptions. Both Maps are read-only.
map.setReadOnly(true).put(1000,”nogood”);
submap.put(1000,”nogood”);

Timestamp Features
AccumuloSortedMap makes use of Accumulo's timestamp features
and AgeOffFilter. Each map entry has an insert timestamp:
long insertTimestamp = map.getTimestamp(key);
Can filter map by timestamp. Implemented on tablet servers.
AccumuloSortedMap timeFiltered = map.timeFilter(fromTs, toTs);
Can set an entry TTL in ms. Implemented on tablet servers. Timed
out entries are wiped during compaction:
map.setTimeOutMs(5000);

Filter Entries by Regex
A bundled iterator filters entries on tablet servers by
comparing key.toString() and value.toString() to regexs. To
filter all keys that match “a(b|c)”:
map.put(“ac”,”1”);
map.put(“ax”,”2”);
map.put(“ab”,”3”);
// has only 1st and 3rd entries:
AccumuloSortedMap filtered = map.regexKeyFilter(“a(b|c)”);

Sampling and Partitioning Features
● AccumuloSortedMap supports sampling and partitioning on the tablet
servers using the supplied SamplingFilter (Accumulo iterator).
● You can derive a map that is a random sample:
AccumuloSortedMap sampleSubmap = map.sample(0.5);
● Or you can define a Sampler which will “freeze” a fixed subsample:
Sampler s = new Sampler(“my_sample_seed”,0.0,0.1,fromTs, toTs);
AccumuloSortedMap frozenSample = map.sample(s);
● When you supply a sample_seed, you define an ordering of the
keys by hash(sample_seed + key bytes). The same hash range
within that ordering will produce the same sample. The fractions
indicate the hash range.

Map Aggregates Computed on
Tablet Servers
● Aggregate functions are implemented using iterators
that calculate aggregate quantities over the entire
tablet server. The results are then combined locally.
● Similar to MapReduce with # mappers = # tservers
and # reducers = 1.
● Examples of built-in aggregate methods : size(),
checksum(), keyStats(), valueStats()

Efficient One-to-Many Mapping
● AccumuloSortedMap can be configured to allow multiple
values per key.
● Works by changing the VersioningIterator settings.
● SortedMap functions still work and see only the latest value.
● Extra methods give iterators over multiple values:
– Iterator<V> getAll(Object key)
– Iterator<Entry<K,V>> multiEntryIterator()
● All values for a given key will be stored on the same tablet
server. This enables server-side per-row aggregates. Like
SQL GROUP BY.

One-to-Many Example
map.setMaxValuesPerKey(-1); // unlimited
map.put(1, 2);
map.put(1, 3);
map.put(1, 4);
map.put(2, 22);
AccumuloSortedMap<Number, StatisticalSummary> row_stats = map.rowStats();
StatisticalSummary row1= map.row_stats.get(1);
row1.getMean(); // =3.0;
row1.getMax(); // = 4.0
// count multiple values
sizeAsLong(true); // = 4
//sum all values, looking at 1 value per key. 4 +22
map.valueStats().getSum(); // = 26.0
//sum all values, looking at multiple values per key. 2+3+4+22
map.valueStats(true).getSum(); // = 31

Writing Custom Transformations and
Aggregates
● Accumulo Collections provides useful abstract iterators
that operate on deserialized java Objects.
– Iterators are passed the SerDe classnames so that they
can read the deserialized Objects.
● You can extends these iterators to implement your own
transformations and aggregates. The API is very simple:
abstract Object transformValue(Object k, Object v);
abstract boolean allow(Object k, Object v);

Example: Custom Javascript
Tranformation
As an example of custom transformations, consider
ScriptTransformingIterator in the “experimental” package. You can pass
javaScript code, which is interpreted on the tablet servers. The key and
value bind to javaScript variables “k” and “v”. For example:
Allow only entries with even keys:
AccumuloSortedMap evens = map.jsFilter("k % 2 == 0");
Map of key → 3*value:
AccumuloSortedMap tripled = map.jsTransform(" 3*v ");
These examples work on keys and values that are java Numbers. Other
javascript functions also work on Strings, java Maps, etc.

Foreign Keys
Accumulo Collections provides a serializable ForeignKey Object which is
like a symbolic link that points to a map plus a key. There is no integrity
checking of the link:
map1.put("key1", "value1");
ForeignKey fk_to_key1 = map1.makeForeignKey("key1");
map2.put("key2", fk_to_key1);
// both equals "value1"
fk_to_key1.resolve(conn);
map2.get("key2").resolve(conn);

Using AccumuloSortedMapFactory
● The map factory is the preferred way to construct
AccumuloSortedMaps. The factory is itself a map
of (map name→ map metadata) with default
settings. The factory:
– acts as a namespace, mapping map names to real
Accumulo table names.
– Configures SerDes.
– Configures other metadata like
max_values_per_key.

Factory Example
AccumuloSortedMapFactory factory;
AccumuloSortedMap map;
factory = new AccumuloSortedMapFactory(conn,“factory_table”);
// 10 values per key default for all maps
factory.addDefaultProperty(MAP_PROPERTY_VALUES_PER_KEY , ”10” );
// 5000ms timeout in map “mymap”
factory.addMapSpecificProperty(“mymap”, MAP_PROPERTY_TTL, ”5000”);
map = factory.makeMap(“mymap”);

More about SerDes
● Accumulo uses BytesWritable.compareTo() to
compare keys on the tablet servers.
– No way to set alternate comparator (?)
● Keys must be serialized in such a way that byte
sort order is same as java sort order.
● FixedPointSerde, the default SerDe, writes
Numbers in fixed point unsigned format so that
numerical comparison works. Other Objects are
java serialized.

Bulk Import, Saving Dervied Maps
● The putAll and importAll methods in AccumuloSortedMap batch
writes to Accumulo, unlike put(). You can save a derived map using
putAll:
map.putAll(someOtherMap);
● importAll() is like putAll, but take an Iterator as an argument. This
can be used to import entries from other sources, like input streams
and files.
map.importAll(new TsvInputStreamIterator(“importfile.tsv”));
● Aside from batching, putAll() and importAll() do not do anything
special on the tablet servers. The import data all passes through the
local machine to Accumulo. The optional KeyValueTransformer runs
locally.

Benchmarks
● I benchmarked Accumulo Collections against raw
Accumulo read/writes on a toy Accumulo cluster
running in Docker. All the moving parts of a real
cluster, but running on one machine.
● All tests so far indicate that Accumulo Collections
adds very little overhead (~10%) to normal
Accumulo operation.
● I would appreciate it if someone sends me
benchmarks from a proper cluster!

Benchmark Data
read
write batched
write unbatched
0 2 4 6 8 10 12 14 16 18
Raw Accumulo vs Accumulo Collections
median time in ms, 10000 operations
raw
Acc Collections
median time (ms)

Performance Tips
● Batched writes are much faster. Use putAll() and
importAll() in place of put() when possible.
– Write your changes locally to a memory-based
Map, then store in bulk with putAll().
● Iterating over a range is much faster than lots of
individual get() calls.
– If you need to do lots of get() calls over a small
submap, you can cache a map locally in memory
with the localCopy() method.

Contact Info
● I'm available for hire. You can email me at
jwolff@isentropy.com. My consulting company,
Isentropy, is online at https://isentropy.com .
● Accumulo Collections is available on Github at
https://github.com/isentropy/accumulo-collections
● Constructive questions and comments welcome.

Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo Interface

More Related Content

What's hot

Similar to Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo Interface

Recently uploaded

Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo Interface