Lessons Learned: Optimizing Accumulo as a Backend for User Applications

Lessons Learned
OPTIMIZING ACCUMULO AS A BACKEND FOR USER
APPLICATIONS

Problems With the Web Architecture
Volume Response Time

Why Accumulo
Ability to Scale Visibility Built In Flexible Storage Model

Accumulo’s Shortcomings
Indices are not built in
No SLA for response time
Storage technique can greatly affect scan response time
Large datasets can take minutes to fully return

Solutions to Specific Issues
Cache query results
Increase speed of scans
Data sampling
Returning partial results/pagination

Caching Results
Problem
◦ No SLA for response time
◦ Many users issue the same queries
Benefit
◦ Results of common queries can be stored
◦ Reduce the amount of times Accumulo is scanned
Requirement
◦ Mirror Accumulo’s sort order
◦ Implement visibility
Solution
◦ Redis

Caching Results: Redis
Open source
In-memory data structure store
Supports multiple data structures
Built in replication

Caching Results: Sorted Set
Redis contains a few data structures but we are looking to easily
replicate Accumulo’s behavior
The best choice is a sorted set, but it may not be for the most
obvious of reasons
Default behavior is to sort based on the elements score
◦ Secondary behavior: elements with the same score are sorted
lexicographically
◦ We can exploit this functionality to easily mimic Accumulo’s behavior in
Redis

Caching Results: Adding Elements
redis> ZADD accumulo 1 apple
(integer) 1
redis> ZADD accumulo 1 banana
(integer) 1
redis> ZADD accumulo 1 beet
(integer) 1
redis> ZADD accumulo 1 avocado
(integer) 1

Caching Results: Retrieving Elements
redis> ZRANGEBYLEX accumulo - +
1) "apple"
2) "avocado"
3) "banana"
4) "beet"
redis> ZRANGEBYLEX accumulo [avocado (beet
1) "avocado"
2) "banana"

Caching Results: Reverse Order
redis> ZREVRANGEBYLEX accumulo + -
1) "beet"
2) "banana"
3) "avocado"
4) "apple"
redis> ZREVRANGEBYLEX accumulo (beet [avocado
1) "banana"
2) "avocado"

Caching Results: Enforcing Visibility
redis> ZADD accumulo:visibility 1 pear
(integer) 1
redis> ZADD accumulo:visibility 1 plum
(integer) 1
redis> KEYS *
1) "accumulo:visibility"
2) "accumulo"
Encode it in the key!

Increase Speed of Scans
Problem
◦ Trying to return too much data
◦ Storage is not optimal
Benefit
◦ Lowering response time for Accumulo access
Requirement
◦ Minimal effort to increase scan time
Solutions
◦ Combine results and compress results
◦ Store results more efficiently

Increase Speed of Scan: Combine and Compress
RowID CF:CQ Vis Value
apple fruit:produce [] 10
avocado fruit:produce [] 4
banana fruit:produce [] 11
beet vegetable:produce [] 3
Results will be returned in order: 10, 4, 11, 3
Using an iterator to combine the results and compress them will result in less
network traffic

Increase Speed of Scans: Store More Efficiently
RowID CF:CQ Vis Value
fruit department:produce [] {"apple":10, "avocado":4, "banana":11}
vegetable department:produce [] {"beet":3}
Similar to previous solution except no need for an iterator
Greatly reduces the amount of next() and seek() calls

Data Sampling
Problem
◦ Data being retrieved is too large
◦ Users may only want to see representation
Benefit
◦ Quickly return results
Requirement
◦ Know how much data to return
Solution
◦ Reservoir sampling

Data Sampling: Reservoir Sampling
Randomly chooses a sample of k items from a list S containing n
items, where n is either a very large or unknown number
A sample algorithm
◦ Keep the first k items in memory
◦ For the i-th item, where i>k:
◦ with probability k/i, keep the new item (discard an old one, selecting which to replace at
random, each with chance 1/k)
◦ with probability 1−k/i, keep the old items (ignore the new one)

Partial Results/Pagination
Problem
◦ Data being retrieved is too large
Benefit
◦ Quickly return results
◦ Cache results
◦ Show fraction of entire result
Requirement
◦ Know how much data to return
Solution
◦ Multithreaded client caching results in Redis

Partial Results: Multithreaded Client
Immediately return the first n results
In the background load the remaining results into the
cache
As users want more data, load more data from cache

Pagination: Redis Cache
redis> ZRANGE accumulo 0 1
1) "apple"
2) "avocado"
redis> ZRANGE accumulo 2 4
1) "banana"
2) "beet"

Known Gaps
Cache
◦ Redis is single threaded
◦ Invalidate cache due to changing data
◦ Support for full visibility expressions
Pagination
◦ Start from middle of set
◦ Know all the pages within a dataset before the entire set is read

Questions
Contact Info
Email: zachary.radtka@gmail.com
Twitter: @zachradtka

Lessons Learned: Optimizing Accumulo as a Backend for User Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Lessons Learned: Optimizing Accumulo as a Backend for User Applications

Similar to Lessons Learned: Optimizing Accumulo as a Backend for User Applications (20)

Recently uploaded

Recently uploaded (20)

Lessons Learned: Optimizing Accumulo as a Backend for User Applications