SlideShare a Scribd company logo
© Josh Elser 2015, Hortonworks
Alternatives to Apache Accumulo’s Java API
Josh Elser
@josh_elser (@hortonworks)
© Josh Elser 2015, Hortonworks
Or…
I’m really tired of having to write Java
code all the time and I want to use
something else.
© Josh Elser 2015, Hortonworks
Or…
OK, I’ll still write Java, but I’m 110%
done with re-writing the same
boilerplate to parse CLI args, convert
records into a standard format, deal
with concurrency and retry server-
side errors...
© Josh Elser 2015, Hortonworks
You have options
There is life after Accumulo’s Java API:
● Apache Pig
● Apache Hive
● Accumulo’s “Thrift Proxy”
● Other JVM-based languages
● Cascading/Scalding
● Spark
© Josh Elser 2015, Hortonworks
Lots of integration points, lots of considerations:
We want to avoid numbering each consideration because
each differ in importance depending on the application.
Every decision has an effect
Maturity
Stability
Performance
Extensibility
Ease of use
© Josh Elser 2015, Hortonworks
Maturity
How well-adopted is the code you’re using?
Where does the code live? Is there a structured
community or is it just sitting in a Github repository?
Can anyone add fixes and improvements? Are they
merged/accepted (when someone provides them)?
Are there tests and are they actually run?
Are releases made and published regularly?
Your own code is difficult enough to maintain.
© Josh Elser 2015, Hortonworks
Stability
Is there a well-defined user-facing API to use?
Cross-project integrations are notorious in making
assumptions about how you should use the code.
Does the integration produce the same outcomes that the
“native” components do?
Can users reliably expect code to work across versions?
Using some external integration should feel like using the
project without that integration. Code that worked once
should continue to work.
© Josh Elser 2015, Hortonworks
Performance
Does the code run sufficiently quick enough?
Can you saturate your physical resources with ease?
Do you have to spend days in a performance tool
reworking how you use the API?
Does the framework spend excessive amounts of time
converting types to some framework?
Can you get an answer in an acceptable amount of time?
Each use case has its own set of performance
requirements. Experimentation is necessary.
© Josh Elser 2015, Hortonworks
Ease of Use
Can you write the necessary code in a reasonable
amount of time?
Goes back to: “am I sick of writing verbose code (Java)”?
Choosing the right tool can drastically reduce the amount
of code to write.
Can the solution to your problem be reasonably expressed
in the required language?
Using a library should feel natural and enjoyable to write
while producing a succinct solution.
© Josh Elser 2015, Hortonworks
Extensibility
Does the integration support enough features of the
underlying system?
Can you use the novel features of the underlying system
via the integration?
Can custom parsing/processing logic be included?
How much external configuration/setup is needed before
you can invoke your code?
Using an integration should not require sacrifice in the
features of the underlying software.
© Josh Elser 2015, Hortonworks
Apply it to Accumulo!
Let’s take these 5 points and see how they apply to some of
the more well-defined integration projects.
We’ll use Accumulo’s Java API as the reference point for
how we judge other projects.
© Josh Elser 2015, Hortonworks
Accumulo Java API
Reference implementation on how clients use Accumulo.
Comprised of Java methods and classes with extreme
assessment on the value and effectiveness of each.
M: Evaluated/implemented by all Accumulo developers. Well-test and
heavily critiqued.
S: Follows SemVer since 1.6.1. Is the definition of the Accumulo API.
P: High-performance, typically limited by network and server-side impl.
EoU: Verbose and often pedantic. Implements low-level operations key-
value centric operations, not high-level application functions.
E: Provides well-defined building blocks for implementing custom libraries
and exposes methods for interacting with all Accumulo features.
© Josh Elser 2015, Hortonworks
Apache Pig
Apache Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data
analysis programs.[1]
Default execution runs on YARN (MapReduce and Tez).
Pig is often adored for its fast prototyping and data analysis abilities
with “Pig Latin”: functions which perform operations on Tuples.
Pig Latin allows for very concise solutions to problems.
LoadStoreFunc interface enables AccumuloStorage
1. http://pig.apache.org
© Josh Elser 2015, Hortonworks
Apache Pig
-- Load a text file of data
A = LOAD 'student.txt' AS ( name:chararray, term:chararray, gpa:float);
-- Group records by student
B = GROUP A BY name;
-- Average GPAs per student
C = FOREACH B GENERATE A. name, AVG(A.gpa);
3 lines of Pig Latin, would take hundreds of lines in Java just to read the
data.
AccumuloStorage introduced in Apache Pig 0.13.0
Maps each tuple into an Accumulo row.
Very easy to both write/read data to/from Accumulo.
STORE flights INTO 'accumulo://flights?instance=...' USING
org.apache.pig.backend.hadoop.accumulo.AccumuloStorage(
'carrier_name,src_airport,dest_airport,tail_number');
© Josh Elser 2015, Hortonworks
Apache Pig
Pig enables users to perform lots of powerful data
manipulation and computation task with little code but
requires users to learn Pig Latin which is unique.
M: Apache Pig is a very well-defined community with its own processes.
S: Use of Pig Latin with AccumuloStorage feels natural and doesn’t have
edge cases which are unsupported.
P: Often suffers from the under-optimization that comes with generalized
MapReduce. Will incur penalties for quick jobs (with MapReduce only). Not
as fast as well-architected, hand-written code.
EoU: Very concise and easy to use. Comes with most of the same drawbacks
of dynamic programming languages. Not straightforward to test.
E: Requires user intervention to create/modify tables with custom
configuration and splits. Column visibility on a per-cell basis is poorly
represented because Pig Latin doesn’t have the ability to support it well.
© Josh Elser 2015, Hortonworks
Apache Hive
Apache Hive is data warehouse software that facilitates
querying and managing large datasets residing in
distributed storage.[1]
One of the “old-time” SQL-on-Hadoop software projects.
Fought hard against the “batch-only” stigma recently building on top
of Tez for ‘interactive queries”
Defines Hive Query Language (HQL) which is close to, but not quite,
compatible with the SQL-92 standard.
Defines extension points which allow for external storage engines
known as StorageHandlers.
1. http://hive.apache.org
© Josh Elser 2015, Hortonworks
Apache Hive
# Create a Hive table from the Accumulo table “my_table”
> CREATE TABLE my_table(uid string, name string, age int, height int)
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH SERDEPROPERTIES ("accumulo.columns.mapping" =
":rowID,person:name,person:age,person:height"
);
# Run “SQL” queries
> SELECT name, height, uid FROM my_table ORDER BY height;
Like Pig, simple queries can be executed with very little amounts of code
and each record maps into an Accumulo row.
Unlike Pig, generating these tables in Hive itself is often difficult and is
reliant upon first creating a “native” Hive table and then inserting the data
into an AccumuloStorageHandler-backed Hive table.
AccumuloStorageHandler introduced in Apache Hive 0.14.0. With the
use of Tez, “point” queries on the rowID can be executed extremely quickly:
> SELECT * FROM my_table WHERE uid = “12345”;
© Josh Elser 2015, Hortonworks
Apache Hive
Using SQL to query Accumulo is a refreshing change, but
the write-path with Hive leaves a bit to be desired. Will
often require data ingest through another tool.
M: Apache Hive is a very well-defined community with its own processes.
S: HQL sometimes feels a bit clunky due to limitations of the
StorageHandler interface.
P: Lots of effort here in Hive recently using Apache Calcite and Apache Tez to
optimize query execution and reduce MapReduce overhead. Translating
Accumulo Key-Values to Hive’s types can be expensive as well.
EoU: HQL as it stands is close enough to make those familiar with SQL feel at
home. Some oddities to work around, but are typically easy to deal with.
E: Like Pig, Hive also suffers from the lack of an ability to represent features
like cell-level visibility. Some options like table configuration, are exposed
through Hive, but most cases will require custom manipulation and
configuration of Accumulo tables before using Hive.
© Josh Elser 2015, Hortonworks
Accumulo “Thrift Proxy”
Apache Thrift is software framework which combines a
software stack with a code generation engine to build
cross-language services.[1]
Thrift is the software that Accumulo builds its client-server RPC
service on.
Thrift provides desirable features such as optional message fields and
well-performing abstractions over the low-level details such as
threading and connection management.
Clients and servers don’t need to be implemented in the same
language as each other.
1. http://thrift.apache.org
© Josh Elser 2015, Hortonworks
Accumulo “Thrift Proxy”
Clients could directly implement the necessary code to
speak directly to Accumulo Master and TabletServers,
but that is an extremely large undertaking.
Accumulo provides an optional “Proxy” process which
provides a Java API-like interface over Thrift instead of
the low-level RPC Thrift API.
Accumulo bundles Python and Ruby client bindings by
default. Generating other languages is simple when Thrift
is already installed.
1. http://thrift.apache.org
© Josh Elser 2015, Hortonworks
proxy.createTable(login, table, true, Accumulo::TimeType::MILLIS)
unless proxy.tableExists(login,table)
update1 = Accumulo::ColumnUpdate.new({'colFamily' => "cf1",
'colQualifier' => "cq1", 'value'=> "a"})
update2 = Accumulo::ColumnUpdate.new({'colFamily' => "cf2",
'colQualifier' => "cq2", 'value'=> "b"})
proxy.updateAndFlush(login, table,{'row1' => [update1,update2]})
cookie = proxy.createScanner(login, table, nil)
result = proxy.nextK(cookie,10)
result.results.each{ |keyvalue| puts "Key: #{keyvalue.key.inspect}
Value: #{keyvalue.value}" }
if not client.tableExists(login, table):
client.createTable(login, table, True, TimeType.MILLIS)
row1 = {'a':[ColumnUpdate('a','a',value='value1'),
ColumnUpdate('b','b',value='value2')]}
client.updateAndFlush(login, table, row1)
cookie = client.createScanner(login, table, None)
for entry in client.nextK(cookie, 10).results:
print entry
Accumulo “Thrift Proxy”
Ruby
Python
© Josh Elser 2015, Hortonworks
Accumulo “Thrift Proxy”
The first noticeable difference in implementations is that
the performance of writing a Python or Ruby client will
be much less than a native Java client.
Some of the performance loss is likely in using a dynamic
language. Your experience in the language is relevant too.
Most of the performance loss is due to passing all requests
through the Proxy before it reaches TabletServers.
Proxy servers are not highly available and would require
manual load balancing. Single client environments work
well, but many active clients will overload a Proxy.
1. http://thrift.apache.org
© Josh Elser 2015, Hortonworks
Accumulo “Thrift Proxy”
The novelty of using languages like Python and Ruby to
interact with Accumulo is enjoyable. The Proxy’s
architecture will not scale well past a few clients.
M: The Proxy isn’t widely (publicly) used but is generally maintained by devs.
S: Because the Proxy server API isn’t in the Accumulo Public API, no
guarantees are made on its methods
P: High availability and load balancing are left to users to solve. Will take
significant engineering effort to smartly scale to supporting many clients.
EoU: Thrift tends to generate decent code to work with for each supported
language which makes writing clients feel relatively natural.
E: The generated client code per language could easily be extended to act
more like an ORM. The full spectrum of Accumulo’s Java API should be
exposed via the Proxy which doesn’t impose limitations in use.
© Josh Elser 2015, Hortonworks
Cascading and Spark
Apache Spark has been causing big waves in the Hadoop
community for the past year, touted across the spectrum
as a complete replacement for MapReduce to
complementary technology.
Cascading (not at the ASF but is ASLv2) is an abstraction
layer on top of various Hadoop components. It’s been
around for quite some time now and is well-received.
Both suffer from a lack of well-defined upstream Accumulo
adoption within their respective communities. Snippets
can be found online, but they’re typically end-user
developed additions.
Lots of opportunities for users to step up and improve each!
© Josh Elser 2015, Hortonworks
Clojure and Scala
Clojure and Scala are both examples of languages which
run on the JVM that are not Java.
These languages should both natively support the
Accumulo Java API, although it’s somewhat uncharted
territory that may have subtle bugs (ACCUMULO-3718)
Github has a spattering of example code, but there lack
definitive resources for both Clojure and Scala.
Lots of opportunity for users to step up and improve
support for these languages!
© Josh Elser 2015, Hortonworks
Concrete Comparison
Let’s do a comparison on the effort needed to analyze some real data.
Stanford hosts a collection of Amazon reviews (~35M records, ~14G gzip) that
are available for use.[1]
Reviews retain their category from Amazon (e.g.
Books, Music, Instant Video) as well as some metadata such as the user who
made the review, the score and the review text. Reviews are an integer value
between 1 and 5 inclusive.
The steps taken were as follows:
1. Convert the raw files into CSV (custom Java code)
2. Insert the data into an Accumulo table (custom Java code)
3. Answer a query using the Accumulo Java API, Pig and Hive.
The question is relatively simple and (hopefully) representative of a practical
problem to solve: compute the average review on books by each identified
users. If I made two book reviews with scores 1 and 5, the query would return
a value of 3 for me as (1 + 5) / 2 = 3.
1. http://snap.stanford.edu/data/web-Amazon-links.html: J. McAuley and J. Leskovec. Hidden factors and hidden topics:
understanding rating dimensions with review text. RecSys, 2013.
© Josh Elser 2015, Hortonworks
Concrete Comparison
To answer the question, we need to scan Accumulo, apply two filters, group
reviews for the same user together and compute an average. I wrote a simple
parser and ingester in ~750 lines of Java (leveraging some libraries).
Accumulo Java API:
A single-threaded client which performs all of this in memory can be
achieved in 162 lines of code. Doesn’t use any custom iterators. Not a
MapReduce job so the grouping phase must fit in memory. More work is
needed to actually scale this solution.
Pig:
1 line of Pig Latin to define the relation (table), 4 lines which perform the
computations and 1 line to output the data to the console.
Hive:
1 line to register our Accumulo table as a Hive table, and 1 HQL statement.
Both Pig and Hive also have the ability to run as a MapReduce job which
means that they can handle much larger datasets automatically.
© Josh Elser 2015, Hortonworks
Takeaways
Take stock of your application needs and run your own
experiments!
Every approach has it’s pros and cons, with the Accumulo
Java API really only suffering from the verbosity and
boilerplate of Java applications themselves.
Because each application is different, it’s important to take
stock of which problems need to be solved, which can be
“hacked”, and which can be completely ignored.
Whatever you do choose, make an effort to contribute
back to the community in some way!
© Josh Elser 2015, Hortonworks
Credit where credit is due
Amazon Reviews: http://snap.stanford.edu/data/web-Amazon-links.html: J.
McAuley and J. Leskovec. Hidden factors and hidden topics: understanding
rating dimensions with review text. RecSys, 2013.
Other code used for the experiments:
● Parser, ingester, and query code: https://github.com/joshelser/as2015
● Library to help ingest the data: https://github.com/joshelser/cereal
Names (Apache, Apache $Project, and $Project) and logos are trademarks of the
ASF and the respective Apache $Projects: Accumulo, Hive, Pig, Spark, and Thrift.
The Cascading logo used was from http://www.cascading.org/
The Clojure logo used was from http://clojure.org/
The Scala logo used was copied from http://www.scala-lang.org/
© Josh Elser 2015, Hortonworks
Thanks!
@josh_elser
joshelser@gmail.com

More Related Content

What's hot

Running a container cloud on YARN
Running a container cloud on YARNRunning a container cloud on YARN
Running a container cloud on YARN
DataWorks Summit
 
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDevTriple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Werner Keil
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013Hortonworks
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache Ambari
Hortonworks
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
Apache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureApache Ambari: Past, Present, Future
Apache Ambari: Past, Present, Future
Hortonworks
 
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
DataWorks Summit
 
Apache Ambari - What's New in 2.4
Apache Ambari - What's New in 2.4 Apache Ambari - What's New in 2.4
Apache Ambari - What's New in 2.4
Hortonworks
 
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?
Hortonworks
 
Apache Ambari - What's New in 2.0.0
Apache Ambari - What's New in 2.0.0Apache Ambari - What's New in 2.0.0
Apache Ambari - What's New in 2.0.0
Hortonworks
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit
 
Kafka Security
Kafka SecurityKafka Security
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
DataWorks Summit
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
 
A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3 A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3
DataWorks Summit
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
DataWorks Summit
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
DataWorks Summit
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
busbey
 

What's hot (20)

Running a container cloud on YARN
Running a container cloud on YARNRunning a container cloud on YARN
Running a container cloud on YARN
 
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDevTriple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache Ambari
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
Apache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureApache Ambari: Past, Present, Future
Apache Ambari: Past, Present, Future
 
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
 
Apache Ambari - What's New in 2.4
Apache Ambari - What's New in 2.4 Apache Ambari - What's New in 2.4
Apache Ambari - What's New in 2.4
 
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?
 
Apache Ambari - What's New in 2.0.0
Apache Ambari - What's New in 2.0.0Apache Ambari - What's New in 2.0.0
Apache Ambari - What's New in 2.0.0
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3 A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 

Similar to Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

Deep learning on HDP 2018 Prague
Deep learning on HDP 2018 PragueDeep learning on HDP 2018 Prague
Deep learning on HDP 2018 Prague
Timothy Spann
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
Khalid Salama
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
Muhammad Ali
 
Stay fresh
Stay freshStay fresh
Stay fresh
Ahmed Mohamed
 
Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
Camuel Gilyadov
 
Swift programming language
Swift programming languageSwift programming language
Swift programming language
Nijo Job
 
Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera - Amr Awadallah - Hadoop World 2010Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera, Inc.
 
10 Building Blocks for Enterprise JavaScript
10 Building Blocks for Enterprise JavaScript10 Building Blocks for Enterprise JavaScript
10 Building Blocks for Enterprise JavaScript
Geertjan Wielenga
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Yahoo Developer Network
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
Cloudera, Inc.
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
gagravarr
 
Succeding with the Apache SOA stack
Succeding with the Apache SOA stackSucceding with the Apache SOA stack
Succeding with the Apache SOA stack
Johan Edstrom
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
 
Big data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopBig data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond Hadoop
HPCC Systems
 
Hpcc
HpccHpcc
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experienceIgor Anishchenko
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experience
Alex Tumanoff
 
Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystem
Amit Bhardwaj
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
Timothy Spann
 
Why scala - executive overview
Why scala - executive overviewWhy scala - executive overview
Why scala - executive overview
Razvan Cojocaru
 

Similar to Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API] (20)

Deep learning on HDP 2018 Prague
Deep learning on HDP 2018 PragueDeep learning on HDP 2018 Prague
Deep learning on HDP 2018 Prague
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 
Stay fresh
Stay freshStay fresh
Stay fresh
 
Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
 
Swift programming language
Swift programming languageSwift programming language
Swift programming language
 
Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera - Amr Awadallah - Hadoop World 2010Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera - Amr Awadallah - Hadoop World 2010
 
10 Building Blocks for Enterprise JavaScript
10 Building Blocks for Enterprise JavaScript10 Building Blocks for Enterprise JavaScript
10 Building Blocks for Enterprise JavaScript
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Succeding with the Apache SOA stack
Succeding with the Apache SOA stackSucceding with the Apache SOA stack
Succeding with the Apache SOA stack
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Big data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopBig data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond Hadoop
 
Hpcc
HpccHpcc
Hpcc
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experience
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experience
 
Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystem
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
 
Why scala - executive overview
Why scala - executive overviewWhy scala - executive overview
Why scala - executive overview
 

Recently uploaded

Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 

Recently uploaded (20)

Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 

Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

  • 1. © Josh Elser 2015, Hortonworks Alternatives to Apache Accumulo’s Java API Josh Elser @josh_elser (@hortonworks)
  • 2. © Josh Elser 2015, Hortonworks Or… I’m really tired of having to write Java code all the time and I want to use something else.
  • 3. © Josh Elser 2015, Hortonworks Or… OK, I’ll still write Java, but I’m 110% done with re-writing the same boilerplate to parse CLI args, convert records into a standard format, deal with concurrency and retry server- side errors...
  • 4. © Josh Elser 2015, Hortonworks You have options There is life after Accumulo’s Java API: ● Apache Pig ● Apache Hive ● Accumulo’s “Thrift Proxy” ● Other JVM-based languages ● Cascading/Scalding ● Spark
  • 5. © Josh Elser 2015, Hortonworks Lots of integration points, lots of considerations: We want to avoid numbering each consideration because each differ in importance depending on the application. Every decision has an effect Maturity Stability Performance Extensibility Ease of use
  • 6. © Josh Elser 2015, Hortonworks Maturity How well-adopted is the code you’re using? Where does the code live? Is there a structured community or is it just sitting in a Github repository? Can anyone add fixes and improvements? Are they merged/accepted (when someone provides them)? Are there tests and are they actually run? Are releases made and published regularly? Your own code is difficult enough to maintain.
  • 7. © Josh Elser 2015, Hortonworks Stability Is there a well-defined user-facing API to use? Cross-project integrations are notorious in making assumptions about how you should use the code. Does the integration produce the same outcomes that the “native” components do? Can users reliably expect code to work across versions? Using some external integration should feel like using the project without that integration. Code that worked once should continue to work.
  • 8. © Josh Elser 2015, Hortonworks Performance Does the code run sufficiently quick enough? Can you saturate your physical resources with ease? Do you have to spend days in a performance tool reworking how you use the API? Does the framework spend excessive amounts of time converting types to some framework? Can you get an answer in an acceptable amount of time? Each use case has its own set of performance requirements. Experimentation is necessary.
  • 9. © Josh Elser 2015, Hortonworks Ease of Use Can you write the necessary code in a reasonable amount of time? Goes back to: “am I sick of writing verbose code (Java)”? Choosing the right tool can drastically reduce the amount of code to write. Can the solution to your problem be reasonably expressed in the required language? Using a library should feel natural and enjoyable to write while producing a succinct solution.
  • 10. © Josh Elser 2015, Hortonworks Extensibility Does the integration support enough features of the underlying system? Can you use the novel features of the underlying system via the integration? Can custom parsing/processing logic be included? How much external configuration/setup is needed before you can invoke your code? Using an integration should not require sacrifice in the features of the underlying software.
  • 11. © Josh Elser 2015, Hortonworks Apply it to Accumulo! Let’s take these 5 points and see how they apply to some of the more well-defined integration projects. We’ll use Accumulo’s Java API as the reference point for how we judge other projects.
  • 12. © Josh Elser 2015, Hortonworks Accumulo Java API Reference implementation on how clients use Accumulo. Comprised of Java methods and classes with extreme assessment on the value and effectiveness of each. M: Evaluated/implemented by all Accumulo developers. Well-test and heavily critiqued. S: Follows SemVer since 1.6.1. Is the definition of the Accumulo API. P: High-performance, typically limited by network and server-side impl. EoU: Verbose and often pedantic. Implements low-level operations key- value centric operations, not high-level application functions. E: Provides well-defined building blocks for implementing custom libraries and exposes methods for interacting with all Accumulo features.
  • 13. © Josh Elser 2015, Hortonworks Apache Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.[1] Default execution runs on YARN (MapReduce and Tez). Pig is often adored for its fast prototyping and data analysis abilities with “Pig Latin”: functions which perform operations on Tuples. Pig Latin allows for very concise solutions to problems. LoadStoreFunc interface enables AccumuloStorage 1. http://pig.apache.org
  • 14. © Josh Elser 2015, Hortonworks Apache Pig -- Load a text file of data A = LOAD 'student.txt' AS ( name:chararray, term:chararray, gpa:float); -- Group records by student B = GROUP A BY name; -- Average GPAs per student C = FOREACH B GENERATE A. name, AVG(A.gpa); 3 lines of Pig Latin, would take hundreds of lines in Java just to read the data. AccumuloStorage introduced in Apache Pig 0.13.0 Maps each tuple into an Accumulo row. Very easy to both write/read data to/from Accumulo. STORE flights INTO 'accumulo://flights?instance=...' USING org.apache.pig.backend.hadoop.accumulo.AccumuloStorage( 'carrier_name,src_airport,dest_airport,tail_number');
  • 15. © Josh Elser 2015, Hortonworks Apache Pig Pig enables users to perform lots of powerful data manipulation and computation task with little code but requires users to learn Pig Latin which is unique. M: Apache Pig is a very well-defined community with its own processes. S: Use of Pig Latin with AccumuloStorage feels natural and doesn’t have edge cases which are unsupported. P: Often suffers from the under-optimization that comes with generalized MapReduce. Will incur penalties for quick jobs (with MapReduce only). Not as fast as well-architected, hand-written code. EoU: Very concise and easy to use. Comes with most of the same drawbacks of dynamic programming languages. Not straightforward to test. E: Requires user intervention to create/modify tables with custom configuration and splits. Column visibility on a per-cell basis is poorly represented because Pig Latin doesn’t have the ability to support it well.
  • 16. © Josh Elser 2015, Hortonworks Apache Hive Apache Hive is data warehouse software that facilitates querying and managing large datasets residing in distributed storage.[1] One of the “old-time” SQL-on-Hadoop software projects. Fought hard against the “batch-only” stigma recently building on top of Tez for ‘interactive queries” Defines Hive Query Language (HQL) which is close to, but not quite, compatible with the SQL-92 standard. Defines extension points which allow for external storage engines known as StorageHandlers. 1. http://hive.apache.org
  • 17. © Josh Elser 2015, Hortonworks Apache Hive # Create a Hive table from the Accumulo table “my_table” > CREATE TABLE my_table(uid string, name string, age int, height int) STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler' WITH SERDEPROPERTIES ("accumulo.columns.mapping" = ":rowID,person:name,person:age,person:height" ); # Run “SQL” queries > SELECT name, height, uid FROM my_table ORDER BY height; Like Pig, simple queries can be executed with very little amounts of code and each record maps into an Accumulo row. Unlike Pig, generating these tables in Hive itself is often difficult and is reliant upon first creating a “native” Hive table and then inserting the data into an AccumuloStorageHandler-backed Hive table. AccumuloStorageHandler introduced in Apache Hive 0.14.0. With the use of Tez, “point” queries on the rowID can be executed extremely quickly: > SELECT * FROM my_table WHERE uid = “12345”;
  • 18. © Josh Elser 2015, Hortonworks Apache Hive Using SQL to query Accumulo is a refreshing change, but the write-path with Hive leaves a bit to be desired. Will often require data ingest through another tool. M: Apache Hive is a very well-defined community with its own processes. S: HQL sometimes feels a bit clunky due to limitations of the StorageHandler interface. P: Lots of effort here in Hive recently using Apache Calcite and Apache Tez to optimize query execution and reduce MapReduce overhead. Translating Accumulo Key-Values to Hive’s types can be expensive as well. EoU: HQL as it stands is close enough to make those familiar with SQL feel at home. Some oddities to work around, but are typically easy to deal with. E: Like Pig, Hive also suffers from the lack of an ability to represent features like cell-level visibility. Some options like table configuration, are exposed through Hive, but most cases will require custom manipulation and configuration of Accumulo tables before using Hive.
  • 19. © Josh Elser 2015, Hortonworks Accumulo “Thrift Proxy” Apache Thrift is software framework which combines a software stack with a code generation engine to build cross-language services.[1] Thrift is the software that Accumulo builds its client-server RPC service on. Thrift provides desirable features such as optional message fields and well-performing abstractions over the low-level details such as threading and connection management. Clients and servers don’t need to be implemented in the same language as each other. 1. http://thrift.apache.org
  • 20. © Josh Elser 2015, Hortonworks Accumulo “Thrift Proxy” Clients could directly implement the necessary code to speak directly to Accumulo Master and TabletServers, but that is an extremely large undertaking. Accumulo provides an optional “Proxy” process which provides a Java API-like interface over Thrift instead of the low-level RPC Thrift API. Accumulo bundles Python and Ruby client bindings by default. Generating other languages is simple when Thrift is already installed. 1. http://thrift.apache.org
  • 21. © Josh Elser 2015, Hortonworks proxy.createTable(login, table, true, Accumulo::TimeType::MILLIS) unless proxy.tableExists(login,table) update1 = Accumulo::ColumnUpdate.new({'colFamily' => "cf1", 'colQualifier' => "cq1", 'value'=> "a"}) update2 = Accumulo::ColumnUpdate.new({'colFamily' => "cf2", 'colQualifier' => "cq2", 'value'=> "b"}) proxy.updateAndFlush(login, table,{'row1' => [update1,update2]}) cookie = proxy.createScanner(login, table, nil) result = proxy.nextK(cookie,10) result.results.each{ |keyvalue| puts "Key: #{keyvalue.key.inspect} Value: #{keyvalue.value}" } if not client.tableExists(login, table): client.createTable(login, table, True, TimeType.MILLIS) row1 = {'a':[ColumnUpdate('a','a',value='value1'), ColumnUpdate('b','b',value='value2')]} client.updateAndFlush(login, table, row1) cookie = client.createScanner(login, table, None) for entry in client.nextK(cookie, 10).results: print entry Accumulo “Thrift Proxy” Ruby Python
  • 22. © Josh Elser 2015, Hortonworks Accumulo “Thrift Proxy” The first noticeable difference in implementations is that the performance of writing a Python or Ruby client will be much less than a native Java client. Some of the performance loss is likely in using a dynamic language. Your experience in the language is relevant too. Most of the performance loss is due to passing all requests through the Proxy before it reaches TabletServers. Proxy servers are not highly available and would require manual load balancing. Single client environments work well, but many active clients will overload a Proxy. 1. http://thrift.apache.org
  • 23. © Josh Elser 2015, Hortonworks Accumulo “Thrift Proxy” The novelty of using languages like Python and Ruby to interact with Accumulo is enjoyable. The Proxy’s architecture will not scale well past a few clients. M: The Proxy isn’t widely (publicly) used but is generally maintained by devs. S: Because the Proxy server API isn’t in the Accumulo Public API, no guarantees are made on its methods P: High availability and load balancing are left to users to solve. Will take significant engineering effort to smartly scale to supporting many clients. EoU: Thrift tends to generate decent code to work with for each supported language which makes writing clients feel relatively natural. E: The generated client code per language could easily be extended to act more like an ORM. The full spectrum of Accumulo’s Java API should be exposed via the Proxy which doesn’t impose limitations in use.
  • 24. © Josh Elser 2015, Hortonworks Cascading and Spark Apache Spark has been causing big waves in the Hadoop community for the past year, touted across the spectrum as a complete replacement for MapReduce to complementary technology. Cascading (not at the ASF but is ASLv2) is an abstraction layer on top of various Hadoop components. It’s been around for quite some time now and is well-received. Both suffer from a lack of well-defined upstream Accumulo adoption within their respective communities. Snippets can be found online, but they’re typically end-user developed additions. Lots of opportunities for users to step up and improve each!
  • 25. © Josh Elser 2015, Hortonworks Clojure and Scala Clojure and Scala are both examples of languages which run on the JVM that are not Java. These languages should both natively support the Accumulo Java API, although it’s somewhat uncharted territory that may have subtle bugs (ACCUMULO-3718) Github has a spattering of example code, but there lack definitive resources for both Clojure and Scala. Lots of opportunity for users to step up and improve support for these languages!
  • 26. © Josh Elser 2015, Hortonworks Concrete Comparison Let’s do a comparison on the effort needed to analyze some real data. Stanford hosts a collection of Amazon reviews (~35M records, ~14G gzip) that are available for use.[1] Reviews retain their category from Amazon (e.g. Books, Music, Instant Video) as well as some metadata such as the user who made the review, the score and the review text. Reviews are an integer value between 1 and 5 inclusive. The steps taken were as follows: 1. Convert the raw files into CSV (custom Java code) 2. Insert the data into an Accumulo table (custom Java code) 3. Answer a query using the Accumulo Java API, Pig and Hive. The question is relatively simple and (hopefully) representative of a practical problem to solve: compute the average review on books by each identified users. If I made two book reviews with scores 1 and 5, the query would return a value of 3 for me as (1 + 5) / 2 = 3. 1. http://snap.stanford.edu/data/web-Amazon-links.html: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.
  • 27. © Josh Elser 2015, Hortonworks Concrete Comparison To answer the question, we need to scan Accumulo, apply two filters, group reviews for the same user together and compute an average. I wrote a simple parser and ingester in ~750 lines of Java (leveraging some libraries). Accumulo Java API: A single-threaded client which performs all of this in memory can be achieved in 162 lines of code. Doesn’t use any custom iterators. Not a MapReduce job so the grouping phase must fit in memory. More work is needed to actually scale this solution. Pig: 1 line of Pig Latin to define the relation (table), 4 lines which perform the computations and 1 line to output the data to the console. Hive: 1 line to register our Accumulo table as a Hive table, and 1 HQL statement. Both Pig and Hive also have the ability to run as a MapReduce job which means that they can handle much larger datasets automatically.
  • 28. © Josh Elser 2015, Hortonworks Takeaways Take stock of your application needs and run your own experiments! Every approach has it’s pros and cons, with the Accumulo Java API really only suffering from the verbosity and boilerplate of Java applications themselves. Because each application is different, it’s important to take stock of which problems need to be solved, which can be “hacked”, and which can be completely ignored. Whatever you do choose, make an effort to contribute back to the community in some way!
  • 29. © Josh Elser 2015, Hortonworks Credit where credit is due Amazon Reviews: http://snap.stanford.edu/data/web-Amazon-links.html: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013. Other code used for the experiments: ● Parser, ingester, and query code: https://github.com/joshelser/as2015 ● Library to help ingest the data: https://github.com/joshelser/cereal Names (Apache, Apache $Project, and $Project) and logos are trademarks of the ASF and the respective Apache $Projects: Accumulo, Hive, Pig, Spark, and Thrift. The Cascading logo used was from http://www.cascading.org/ The Clojure logo used was from http://clojure.org/ The Scala logo used was copied from http://www.scala-lang.org/
  • 30. © Josh Elser 2015, Hortonworks Thanks! @josh_elser joshelser@gmail.com