Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks
Alternatives to Apache Accumulo’s Java API
Josh Elser
@josh_elser (@hortonworks)

Or…
I’m really tired of having to write Java
code all the time and I want to use
something else.

Or…
OK, I’ll still write Java, but I’m 110%
done with re-writing the same
boilerplate to parse CLI args, convert
records into a standard format, deal
with concurrency and retry server-
side errors...

You have options
There is life after Accumulo’s Java API:
● Apache Pig
● Apache Hive
● Accumulo’s “Thrift Proxy”
● Other JVM-based languages
● Cascading/Scalding
● Spark

Lots of integration points, lots of considerations:
We want to avoid numbering each consideration because
each differ in importance depending on the application.
Every decision has an effect
Maturity
Stability
Performance
Extensibility
Ease of use

Maturity
How well-adopted is the code you’re using?
Where does the code live? Is there a structured
community or is it just sitting in a Github repository?
Can anyone add fixes and improvements? Are they
merged/accepted (when someone provides them)?
Are there tests and are they actually run?
Are releases made and published regularly?
Your own code is difficult enough to maintain.

Stability
Is there a well-defined user-facing API to use?
Cross-project integrations are notorious in making
assumptions about how you should use the code.
Does the integration produce the same outcomes that the
“native” components do?
Can users reliably expect code to work across versions?
Using some external integration should feel like using the
project without that integration. Code that worked once
should continue to work.

Performance
Does the code run sufficiently quick enough?
Can you saturate your physical resources with ease?
Do you have to spend days in a performance tool
reworking how you use the API?
Does the framework spend excessive amounts of time
converting types to some framework?
Can you get an answer in an acceptable amount of time?
Each use case has its own set of performance
requirements. Experimentation is necessary.

Ease of Use
Can you write the necessary code in a reasonable
amount of time?
Goes back to: “am I sick of writing verbose code (Java)”?
Choosing the right tool can drastically reduce the amount
of code to write.
Can the solution to your problem be reasonably expressed
in the required language?
Using a library should feel natural and enjoyable to write
while producing a succinct solution.

Extensibility
Does the integration support enough features of the
underlying system?
Can you use the novel features of the underlying system
via the integration?
Can custom parsing/processing logic be included?
How much external configuration/setup is needed before
you can invoke your code?
Using an integration should not require sacrifice in the
features of the underlying software.

Apply it to Accumulo!
Let’s take these 5 points and see how they apply to some of
the more well-defined integration projects.
We’ll use Accumulo’s Java API as the reference point for
how we judge other projects.

Accumulo Java API
Reference implementation on how clients use Accumulo.
Comprised of Java methods and classes with extreme
assessment on the value and effectiveness of each.
M: Evaluated/implemented by all Accumulo developers. Well-test and
heavily critiqued.
S: Follows SemVer since 1.6.1. Is the definition of the Accumulo API.
P: High-performance, typically limited by network and server-side impl.
EoU: Verbose and often pedantic. Implements low-level operations key-
value centric operations, not high-level application functions.
E: Provides well-defined building blocks for implementing custom libraries
and exposes methods for interacting with all Accumulo features.

Apache Pig
Apache Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data
analysis programs.[1]
Default execution runs on YARN (MapReduce and Tez).
Pig is often adored for its fast prototyping and data analysis abilities
with “Pig Latin”: functions which perform operations on Tuples.
Pig Latin allows for very concise solutions to problems.
LoadStoreFunc interface enables AccumuloStorage
1. http://pig.apache.org

Apache Pig
-- Load a text file of data
A = LOAD 'student.txt' AS ( name:chararray, term:chararray, gpa:float);
-- Group records by student
B = GROUP A BY name;
-- Average GPAs per student
C = FOREACH B GENERATE A. name, AVG(A.gpa);
3 lines of Pig Latin, would take hundreds of lines in Java just to read the
data.
AccumuloStorage introduced in Apache Pig 0.13.0
Maps each tuple into an Accumulo row.
Very easy to both write/read data to/from Accumulo.
STORE flights INTO 'accumulo://flights?instance=...' USING
org.apache.pig.backend.hadoop.accumulo.AccumuloStorage(
'carrier_name,src_airport,dest_airport,tail_number');

Apache Pig
Pig enables users to perform lots of powerful data
manipulation and computation task with little code but
requires users to learn Pig Latin which is unique.
M: Apache Pig is a very well-defined community with its own processes.
S: Use of Pig Latin with AccumuloStorage feels natural and doesn’t have
edge cases which are unsupported.
P: Often suffers from the under-optimization that comes with generalized
MapReduce. Will incur penalties for quick jobs (with MapReduce only). Not
as fast as well-architected, hand-written code.
EoU: Very concise and easy to use. Comes with most of the same drawbacks
of dynamic programming languages. Not straightforward to test.
E: Requires user intervention to create/modify tables with custom
configuration and splits. Column visibility on a per-cell basis is poorly
represented because Pig Latin doesn’t have the ability to support it well.

Apache Hive
Apache Hive is data warehouse software that facilitates
querying and managing large datasets residing in
distributed storage.[1]
One of the “old-time” SQL-on-Hadoop software projects.
Fought hard against the “batch-only” stigma recently building on top
of Tez for ‘interactive queries”
Defines Hive Query Language (HQL) which is close to, but not quite,
compatible with the SQL-92 standard.
Defines extension points which allow for external storage engines
known as StorageHandlers.
1. http://hive.apache.org

Apache Hive
# Create a Hive table from the Accumulo table “my_table”
> CREATE TABLE my_table(uid string, name string, age int, height int)
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH SERDEPROPERTIES ("accumulo.columns.mapping" =
":rowID,person:name,person:age,person:height"
);
# Run “SQL” queries
> SELECT name, height, uid FROM my_table ORDER BY height;
Like Pig, simple queries can be executed with very little amounts of code
and each record maps into an Accumulo row.
Unlike Pig, generating these tables in Hive itself is often difficult and is
reliant upon first creating a “native” Hive table and then inserting the data
into an AccumuloStorageHandler-backed Hive table.
AccumuloStorageHandler introduced in Apache Hive 0.14.0. With the
use of Tez, “point” queries on the rowID can be executed extremely quickly:
> SELECT * FROM my_table WHERE uid = “12345”;

Apache Hive
Using SQL to query Accumulo is a refreshing change, but
the write-path with Hive leaves a bit to be desired. Will
often require data ingest through another tool.
M: Apache Hive is a very well-defined community with its own processes.
S: HQL sometimes feels a bit clunky due to limitations of the
StorageHandler interface.
P: Lots of effort here in Hive recently using Apache Calcite and Apache Tez to
optimize query execution and reduce MapReduce overhead. Translating
Accumulo Key-Values to Hive’s types can be expensive as well.
EoU: HQL as it stands is close enough to make those familiar with SQL feel at
home. Some oddities to work around, but are typically easy to deal with.
E: Like Pig, Hive also suffers from the lack of an ability to represent features
like cell-level visibility. Some options like table configuration, are exposed
through Hive, but most cases will require custom manipulation and
configuration of Accumulo tables before using Hive.

Accumulo “Thrift Proxy”
Apache Thrift is software framework which combines a
software stack with a code generation engine to build
cross-language services.[1]
Thrift is the software that Accumulo builds its client-server RPC
service on.
Thrift provides desirable features such as optional message fields and
well-performing abstractions over the low-level details such as
threading and connection management.
Clients and servers don’t need to be implemented in the same
language as each other.
1. http://thrift.apache.org

Clients could directly implement the necessary code to
speak directly to Accumulo Master and TabletServers,
but that is an extremely large undertaking.
Accumulo provides an optional “Proxy” process which
provides a Java API-like interface over Thrift instead of
the low-level RPC Thrift API.
Accumulo bundles Python and Ruby client bindings by
default. Generating other languages is simple when Thrift
is already installed.

proxy.createTable(login, table, true, Accumulo::TimeType::MILLIS)
unless proxy.tableExists(login,table)
update1 = Accumulo::ColumnUpdate.new({'colFamily' => "cf1",
'colQualifier' => "cq1", 'value'=> "a"})
update2 = Accumulo::ColumnUpdate.new({'colFamily' => "cf2",
'colQualifier' => "cq2", 'value'=> "b"})
proxy.updateAndFlush(login, table,{'row1' => [update1,update2]})
cookie = proxy.createScanner(login, table, nil)
result = proxy.nextK(cookie,10)
result.results.each{ |keyvalue| puts "Key: #{keyvalue.key.inspect}
Value: #{keyvalue.value}" }
if not client.tableExists(login, table):
client.createTable(login, table, True, TimeType.MILLIS)
row1 = {'a':[ColumnUpdate('a','a',value='value1'),
ColumnUpdate('b','b',value='value2')]}
client.updateAndFlush(login, table, row1)
cookie = client.createScanner(login, table, None)
for entry in client.nextK(cookie, 10).results:
print entry
Ruby
Python

The first noticeable difference in implementations is that
the performance of writing a Python or Ruby client will
be much less than a native Java client.
Some of the performance loss is likely in using a dynamic
language. Your experience in the language is relevant too.
Most of the performance loss is due to passing all requests
through the Proxy before it reaches TabletServers.
Proxy servers are not highly available and would require
manual load balancing. Single client environments work
well, but many active clients will overload a Proxy.

The novelty of using languages like Python and Ruby to
interact with Accumulo is enjoyable. The Proxy’s
architecture will not scale well past a few clients.
M: The Proxy isn’t widely (publicly) used but is generally maintained by devs.
S: Because the Proxy server API isn’t in the Accumulo Public API, no
guarantees are made on its methods
P: High availability and load balancing are left to users to solve. Will take
significant engineering effort to smartly scale to supporting many clients.
EoU: Thrift tends to generate decent code to work with for each supported
language which makes writing clients feel relatively natural.
E: The generated client code per language could easily be extended to act
more like an ORM. The full spectrum of Accumulo’s Java API should be
exposed via the Proxy which doesn’t impose limitations in use.

Cascading and Spark
Apache Spark has been causing big waves in the Hadoop
community for the past year, touted across the spectrum
as a complete replacement for MapReduce to
complementary technology.
Cascading (not at the ASF but is ASLv2) is an abstraction
layer on top of various Hadoop components. It’s been
around for quite some time now and is well-received.
Both suffer from a lack of well-defined upstream Accumulo
adoption within their respective communities. Snippets
can be found online, but they’re typically end-user
developed additions.
Lots of opportunities for users to step up and improve each!

Clojure and Scala
Clojure and Scala are both examples of languages which
run on the JVM that are not Java.
These languages should both natively support the
Accumulo Java API, although it’s somewhat uncharted
territory that may have subtle bugs (ACCUMULO-3718)
Github has a spattering of example code, but there lack
definitive resources for both Clojure and Scala.
Lots of opportunity for users to step up and improve
support for these languages!

Concrete Comparison
Let’s do a comparison on the effort needed to analyze some real data.
Stanford hosts a collection of Amazon reviews (~35M records, ~14G gzip) that
are available for use.[1]
Reviews retain their category from Amazon (e.g.
Books, Music, Instant Video) as well as some metadata such as the user who
made the review, the score and the review text. Reviews are an integer value
between 1 and 5 inclusive.
The steps taken were as follows:
1. Convert the raw files into CSV (custom Java code)
2. Insert the data into an Accumulo table (custom Java code)
3. Answer a query using the Accumulo Java API, Pig and Hive.
The question is relatively simple and (hopefully) representative of a practical
problem to solve: compute the average review on books by each identified
users. If I made two book reviews with scores 1 and 5, the query would return
a value of 3 for me as (1 + 5) / 2 = 3.
1. http://snap.stanford.edu/data/web-Amazon-links.html: J. McAuley and J. Leskovec. Hidden factors and hidden topics:
understanding rating dimensions with review text. RecSys, 2013.

Concrete Comparison
To answer the question, we need to scan Accumulo, apply two filters, group
reviews for the same user together and compute an average. I wrote a simple
parser and ingester in ~750 lines of Java (leveraging some libraries).
Accumulo Java API:
A single-threaded client which performs all of this in memory can be
achieved in 162 lines of code. Doesn’t use any custom iterators. Not a
MapReduce job so the grouping phase must fit in memory. More work is
needed to actually scale this solution.
Pig:
1 line of Pig Latin to define the relation (table), 4 lines which perform the
computations and 1 line to output the data to the console.
Hive:
1 line to register our Accumulo table as a Hive table, and 1 HQL statement.
Both Pig and Hive also have the ability to run as a MapReduce job which
means that they can handle much larger datasets automatically.

Takeaways
Take stock of your application needs and run your own
experiments!
Every approach has it’s pros and cons, with the Accumulo
Java API really only suffering from the verbosity and
boilerplate of Java applications themselves.
Because each application is different, it’s important to take
stock of which problems need to be solved, which can be
“hacked”, and which can be completely ignored.
Whatever you do choose, make an effort to contribute
back to the community in some way!

Credit where credit is due
Amazon Reviews: http://snap.stanford.edu/data/web-Amazon-links.html: J.
McAuley and J. Leskovec. Hidden factors and hidden topics: understanding
rating dimensions with review text. RecSys, 2013.
Other code used for the experiments:
● Parser, ingester, and query code: https://github.com/joshelser/as2015
● Library to help ingest the data: https://github.com/joshelser/cereal
Names (Apache, Apache $Project, and $Project) and logos are trademarks of the
ASF and the respective Apache $Projects: Accumulo, Hive, Pig, Spark, and Thrift.
The Cascading logo used was from http://www.cascading.org/
The Clojure logo used was from http://clojure.org/
The Scala logo used was copied from http://www.scala-lang.org/

Thanks!
@josh_elser
joshelser@gmail.com

Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

Similar to Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API] (20)

Recently uploaded

Recently uploaded (20)

Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]