SlideShare a Scribd company logo
Introducing Accumulo Collections:
A Practical Accumulo Interface
By Jonathan Wolff
jwolff@isentropy.com
Founder, Isentropy LLC
https://isentropy.com
Code and Documentation on Github
https://github.com/isentropy/accumulo-collections/wiki
Accumulo Needs A Practical API
● Accumulo is great under the hood, but needs a practical
interface for real-world NoSQL applications.
● Could companies use Accumulo in place of MySQL??
● Accumulo needs a layer to:
1) Handle java Object serialization locally and on tablet servers
2) Handle foreign keys/joins.
3) Abstract iterators, so that it's easy to do server-side
computations.
4) Provide a useful library of filters, transformations, aggregates.
What is Accumulo Collections?
● Accumulo Collections is a new, alternative NoSQL framework that
uses Accumulo as a backend. It abstracts powerful Accumulo
functionality in a concise java API.
● Since Accumulo is already a sorted map, java SortedMap is a
natural choice for an interface. It's already familiar to java
developers. Devs who know nothing about Accumulo can use it to
build giant, responsive NoSQL applications.
● But Accumulo Collections is more than a SortedMap
implementation...
● Many features are implemented on the tablet servers by iterators,
and wrapped in java methods. You don't need to understand
Accumulo iterators to use them.
AccumuloSortedMap wraps an
Accumulo table
● AccumuloSortedMap is a java SortedMap implementation that is backed by
an Accumulo table. It handles object serialization and foreign keys, and
abstracts powerful iterator functionality.
● Method calls derive new maps that contain transformations and aggregates.
Derived maps modify the underlying Scanner. This abstracts the concept of
iterators. Derived map methods run on-the-fly and can be chained:
// similar to SQL: WHERE timestamp BETWEEN t0 AND t1 AND rand() > .5
AccumuloSortedMap derivedMap = map.timeFilter(t0,t1).sample(0.5);
// statistical aggregate (mean, sd, n, etc) of values from key range [100,200)
StatisticalSummary stats = map.submap(100, 200).valueStats();
Each of the above methods stacks an iterator on the underlying map. The
iterators make use of SerDes to operate directly on java Objects.
Just like a standard java
SortedMap, but…
● AccumuloSortedMap returns a copy of the map value.
You must put() to save modifications.
● To use sorted map features, the SerDe used must
serialize bytes in same sort order as java Objects.
The default FixedPointSerde is suitable for most
common keys types (strings, primitives, byte[], etc).
More about SerDes later…
● Supports sizes greater than MAX_INT. See
sizeAsLong().
● Can be set to read-only. Derived map methods, which
stack scan iterators, always return read-only maps.
Use Accumulo as a SortedMap
AccumuloSortedMapFactory factory = new AccumuloSortedMapFactory(conn,"factory_name");
AccumuloSortedMap<Long,String> map = factory.makeMap("mapname");
for(long i=0; i<1000; i++){
map.put(i, "value"+i);
};
map.get(123); // equals “value123”
map.keySet().iterator().next(); // equals 0
AccumuloSortedMap submap = map.subMap(100, 150);
submap.size(); // equals 50
submap.firstKey(); // equals 100
submap.keyStats().getSum(); // equals 6225.0
for(Entry<Long,String> e : submap.entrySet()){ // iterate };
// these commands throws Exceptions. Both Maps are read-only.
map.setReadOnly(true).put(1000,”nogood”);
submap.put(1000,”nogood”);
Timestamp Features
AccumuloSortedMap makes use of Accumulo's timestamp features
and AgeOffFilter. Each map entry has an insert timestamp:
long insertTimestamp = map.getTimestamp(key);
Can filter map by timestamp. Implemented on tablet servers.
AccumuloSortedMap timeFiltered = map.timeFilter(fromTs, toTs);
Can set an entry TTL in ms. Implemented on tablet servers. Timed
out entries are wiped during compaction:
map.setTimeOutMs(5000);
Filter Entries by Regex
A bundled iterator filters entries on tablet servers by
comparing key.toString() and value.toString() to regexs. To
filter all keys that match “a(b|c)”:
map.put(“ac”,”1”);
map.put(“ax”,”2”);
map.put(“ab”,”3”);
// has only 1st and 3rd entries:
AccumuloSortedMap filtered = map.regexKeyFilter(“a(b|c)”);
Sampling and Partitioning Features
● AccumuloSortedMap supports sampling and partitioning on the tablet
servers using the supplied SamplingFilter (Accumulo iterator).
● You can derive a map that is a random sample:
AccumuloSortedMap sampleSubmap = map.sample(0.5);
● Or you can define a Sampler which will “freeze” a fixed subsample:
Sampler s = new Sampler(“my_sample_seed”,0.0,0.1,fromTs, toTs);
AccumuloSortedMap frozenSample = map.sample(s);
● When you supply a sample_seed, you define an ordering of the
keys by hash(sample_seed + key bytes). The same hash range
within that ordering will produce the same sample. The fractions
indicate the hash range.
Map Aggregates Computed on
Tablet Servers
● Aggregate functions are implemented using iterators
that calculate aggregate quantities over the entire
tablet server. The results are then combined locally.
● Similar to MapReduce with # mappers = # tservers
and # reducers = 1.
● Examples of built-in aggregate methods : size(),
checksum(), keyStats(), valueStats()
Efficient One-to-Many Mapping
● AccumuloSortedMap can be configured to allow multiple
values per key.
● Works by changing the VersioningIterator settings.
● SortedMap functions still work and see only the latest value.
● Extra methods give iterators over multiple values:
– Iterator<V> getAll(Object key)
– Iterator<Entry<K,V>> multiEntryIterator()
● All values for a given key will be stored on the same tablet
server. This enables server-side per-row aggregates. Like
SQL GROUP BY.
One-to-Many Example
map.setMaxValuesPerKey(-1); // unlimited
map.put(1, 2);
map.put(1, 3);
map.put(1, 4);
map.put(2, 22);
AccumuloSortedMap<Number, StatisticalSummary> row_stats = map.rowStats();
StatisticalSummary row1= map.row_stats.get(1);
row1.getMean(); // =3.0;
row1.getMax(); // = 4.0
// count multiple values
sizeAsLong(true); // = 4
//sum all values, looking at 1 value per key. 4 +22
map.valueStats().getSum(); // = 26.0
//sum all values, looking at multiple values per key. 2+3+4+22
map.valueStats(true).getSum(); // = 31
Writing Custom Transformations and
Aggregates
● Accumulo Collections provides useful abstract iterators
that operate on deserialized java Objects.
– Iterators are passed the SerDe classnames so that they
can read the deserialized Objects.
● You can extends these iterators to implement your own
transformations and aggregates. The API is very simple:
abstract Object transformValue(Object k, Object v);
abstract boolean allow(Object k, Object v);
Example: Custom Javascript
Tranformation
As an example of custom transformations, consider
ScriptTransformingIterator in the “experimental” package. You can pass
javaScript code, which is interpreted on the tablet servers. The key and
value bind to javaScript variables “k” and “v”. For example:
Allow only entries with even keys:
AccumuloSortedMap evens = map.jsFilter("k % 2 == 0");
Map of key → 3*value:
AccumuloSortedMap tripled = map.jsTransform(" 3*v ");
These examples work on keys and values that are java Numbers. Other
javascript functions also work on Strings, java Maps, etc.
Foreign Keys
Accumulo Collections provides a serializable ForeignKey Object which is
like a symbolic link that points to a map plus a key. There is no integrity
checking of the link:
map1.put("key1", "value1");
ForeignKey fk_to_key1 = map1.makeForeignKey("key1");
map2.put("key2", fk_to_key1);
// both equals "value1"
fk_to_key1.resolve(conn);
map2.get("key2").resolve(conn);
Using AccumuloSortedMapFactory
● The map factory is the preferred way to construct
AccumuloSortedMaps. The factory is itself a map
of (map name→ map metadata) with default
settings. The factory:
– acts as a namespace, mapping map names to real
Accumulo table names.
– Configures SerDes.
– Configures other metadata like
max_values_per_key.
Factory Example
AccumuloSortedMapFactory factory;
AccumuloSortedMap map;
factory = new AccumuloSortedMapFactory(conn,“factory_table”);
// 10 values per key default for all maps
factory.addDefaultProperty(MAP_PROPERTY_VALUES_PER_KEY , ”10” );
// 5000ms timeout in map “mymap”
factory.addMapSpecificProperty(“mymap”, MAP_PROPERTY_TTL, ”5000”);
map = factory.makeMap(“mymap”);
More about SerDes
● Accumulo uses BytesWritable.compareTo() to
compare keys on the tablet servers.
– No way to set alternate comparator (?)
● Keys must be serialized in such a way that byte
sort order is same as java sort order.
● FixedPointSerde, the default SerDe, writes
Numbers in fixed point unsigned format so that
numerical comparison works. Other Objects are
java serialized.
Bulk Import, Saving Dervied Maps
● The putAll and importAll methods in AccumuloSortedMap batch
writes to Accumulo, unlike put(). You can save a derived map using
putAll:
map.putAll(someOtherMap);
● importAll() is like putAll, but take an Iterator as an argument. This
can be used to import entries from other sources, like input streams
and files.
map.importAll(new TsvInputStreamIterator(“importfile.tsv”));
● Aside from batching, putAll() and importAll() do not do anything
special on the tablet servers. The import data all passes through the
local machine to Accumulo. The optional KeyValueTransformer runs
locally.
Benchmarks
● I benchmarked Accumulo Collections against raw
Accumulo read/writes on a toy Accumulo cluster
running in Docker. All the moving parts of a real
cluster, but running on one machine.
● All tests so far indicate that Accumulo Collections
adds very little overhead (~10%) to normal
Accumulo operation.
● I would appreciate it if someone sends me
benchmarks from a proper cluster!
Benchmark Data
read
write batched
write unbatched
0 2 4 6 8 10 12 14 16 18
Raw Accumulo vs Accumulo Collections
median time in ms, 10000 operations
raw
Acc Collections
median time (ms)
Performance Tips
● Batched writes are much faster. Use putAll() and
importAll() in place of put() when possible.
– Write your changes locally to a memory-based
Map, then store in bulk with putAll().
● Iterating over a range is much faster than lots of
individual get() calls.
– If you need to do lots of get() calls over a small
submap, you can cache a map locally in memory
with the localCopy() method.
Contact Info
● I'm available for hire. You can email me at
jwolff@isentropy.com. My consulting company,
Isentropy, is online at https://isentropy.com .
● Accumulo Collections is available on Github at
https://github.com/isentropy/accumulo-collections
● Constructive questions and comments welcome.

More Related Content

What's hot

HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
Cloudera, Inc.
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
European Data Forum
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
Hadoop User Group
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
MapReduce
MapReduceMapReduce
Hopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopHopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open Workshop
ExtremeEarth
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
SANTOSH WAYAL
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...
newmooxx
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
Nicola Cadenelli
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
EasyMedico.com
 
Hadoop 3
Hadoop 3Hadoop 3
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
Andrea Iacono
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 

What's hot (20)

HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopHopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open Workshop
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 

Similar to Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo Interface

Amazon elastic map reduce
Amazon elastic map reduceAmazon elastic map reduce
Amazon elastic map reduce
Olga Lavrentieva
 
Best practices in Java
Best practices in JavaBest practices in Java
Best practices in Java
Mudit Gupta
 
ECMAScript 6 Review
ECMAScript 6 ReviewECMAScript 6 Review
ECMAScript 6 Review
Sperasoft
 
Fosdem2017 Scientific computing on Jruby
Fosdem2017  Scientific computing on JrubyFosdem2017  Scientific computing on Jruby
Fosdem2017 Scientific computing on Jruby
Prasun Anand
 
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
InfluxData
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Jyotirmoy Sundi
 
Distributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation ProjectDistributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation Project
Assignmentpedia
 
Distributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation ProjectDistributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation Project
Assignmentpedia
 
Lambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter LawreyLambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter Lawrey
JAXLondon_Conference
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
Sqlapi0.1
Sqlapi0.1Sqlapi0.1
Sqlapi0.1
jitendral
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Getting started with ES6 : Future of javascript
Getting started with ES6 : Future of javascriptGetting started with ES6 : Future of javascript
Getting started with ES6 : Future of javascript
Mohd Saeed
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
Laura Hughes
 
Unit 3
Unit 3 Unit 3
Unit 3
GOWSIKRAJAP
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Yao Yao
 
Java 8
Java 8Java 8
Java 8
vilniusjug
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
Ed Kohlwey
 
Gephi Toolkit Tutorial
Gephi Toolkit TutorialGephi Toolkit Tutorial
Gephi Toolkit Tutorial
Gephi Consortium
 

Similar to Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo Interface (20)

Amazon elastic map reduce
Amazon elastic map reduceAmazon elastic map reduce
Amazon elastic map reduce
 
Best practices in Java
Best practices in JavaBest practices in Java
Best practices in Java
 
ECMAScript 6 Review
ECMAScript 6 ReviewECMAScript 6 Review
ECMAScript 6 Review
 
Fosdem2017 Scientific computing on Jruby
Fosdem2017  Scientific computing on JrubyFosdem2017  Scientific computing on Jruby
Fosdem2017 Scientific computing on Jruby
 
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
 
Distributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation ProjectDistributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation Project
 
Distributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation ProjectDistributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation Project
 
Lambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter LawreyLambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter Lawrey
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Sqlapi0.1
Sqlapi0.1Sqlapi0.1
Sqlapi0.1
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Getting started with ES6 : Future of javascript
Getting started with ES6 : Future of javascriptGetting started with ES6 : Future of javascript
Getting started with ES6 : Future of javascript
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
Unit 3
Unit 3 Unit 3
Unit 3
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Java 8
Java 8Java 8
Java 8
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Gephi Toolkit Tutorial
Gephi Toolkit TutorialGephi Toolkit Tutorial
Gephi Toolkit Tutorial
 

Recently uploaded

一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 

Recently uploaded (20)

一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 

Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo Interface

  • 1. Introducing Accumulo Collections: A Practical Accumulo Interface By Jonathan Wolff jwolff@isentropy.com Founder, Isentropy LLC https://isentropy.com Code and Documentation on Github https://github.com/isentropy/accumulo-collections/wiki
  • 2. Accumulo Needs A Practical API ● Accumulo is great under the hood, but needs a practical interface for real-world NoSQL applications. ● Could companies use Accumulo in place of MySQL?? ● Accumulo needs a layer to: 1) Handle java Object serialization locally and on tablet servers 2) Handle foreign keys/joins. 3) Abstract iterators, so that it's easy to do server-side computations. 4) Provide a useful library of filters, transformations, aggregates.
  • 3. What is Accumulo Collections? ● Accumulo Collections is a new, alternative NoSQL framework that uses Accumulo as a backend. It abstracts powerful Accumulo functionality in a concise java API. ● Since Accumulo is already a sorted map, java SortedMap is a natural choice for an interface. It's already familiar to java developers. Devs who know nothing about Accumulo can use it to build giant, responsive NoSQL applications. ● But Accumulo Collections is more than a SortedMap implementation... ● Many features are implemented on the tablet servers by iterators, and wrapped in java methods. You don't need to understand Accumulo iterators to use them.
  • 4. AccumuloSortedMap wraps an Accumulo table ● AccumuloSortedMap is a java SortedMap implementation that is backed by an Accumulo table. It handles object serialization and foreign keys, and abstracts powerful iterator functionality. ● Method calls derive new maps that contain transformations and aggregates. Derived maps modify the underlying Scanner. This abstracts the concept of iterators. Derived map methods run on-the-fly and can be chained: // similar to SQL: WHERE timestamp BETWEEN t0 AND t1 AND rand() > .5 AccumuloSortedMap derivedMap = map.timeFilter(t0,t1).sample(0.5); // statistical aggregate (mean, sd, n, etc) of values from key range [100,200) StatisticalSummary stats = map.submap(100, 200).valueStats(); Each of the above methods stacks an iterator on the underlying map. The iterators make use of SerDes to operate directly on java Objects.
  • 5. Just like a standard java SortedMap, but… ● AccumuloSortedMap returns a copy of the map value. You must put() to save modifications. ● To use sorted map features, the SerDe used must serialize bytes in same sort order as java Objects. The default FixedPointSerde is suitable for most common keys types (strings, primitives, byte[], etc). More about SerDes later… ● Supports sizes greater than MAX_INT. See sizeAsLong(). ● Can be set to read-only. Derived map methods, which stack scan iterators, always return read-only maps.
  • 6. Use Accumulo as a SortedMap AccumuloSortedMapFactory factory = new AccumuloSortedMapFactory(conn,"factory_name"); AccumuloSortedMap<Long,String> map = factory.makeMap("mapname"); for(long i=0; i<1000; i++){ map.put(i, "value"+i); }; map.get(123); // equals “value123” map.keySet().iterator().next(); // equals 0 AccumuloSortedMap submap = map.subMap(100, 150); submap.size(); // equals 50 submap.firstKey(); // equals 100 submap.keyStats().getSum(); // equals 6225.0 for(Entry<Long,String> e : submap.entrySet()){ // iterate }; // these commands throws Exceptions. Both Maps are read-only. map.setReadOnly(true).put(1000,”nogood”); submap.put(1000,”nogood”);
  • 7. Timestamp Features AccumuloSortedMap makes use of Accumulo's timestamp features and AgeOffFilter. Each map entry has an insert timestamp: long insertTimestamp = map.getTimestamp(key); Can filter map by timestamp. Implemented on tablet servers. AccumuloSortedMap timeFiltered = map.timeFilter(fromTs, toTs); Can set an entry TTL in ms. Implemented on tablet servers. Timed out entries are wiped during compaction: map.setTimeOutMs(5000);
  • 8. Filter Entries by Regex A bundled iterator filters entries on tablet servers by comparing key.toString() and value.toString() to regexs. To filter all keys that match “a(b|c)”: map.put(“ac”,”1”); map.put(“ax”,”2”); map.put(“ab”,”3”); // has only 1st and 3rd entries: AccumuloSortedMap filtered = map.regexKeyFilter(“a(b|c)”);
  • 9. Sampling and Partitioning Features ● AccumuloSortedMap supports sampling and partitioning on the tablet servers using the supplied SamplingFilter (Accumulo iterator). ● You can derive a map that is a random sample: AccumuloSortedMap sampleSubmap = map.sample(0.5); ● Or you can define a Sampler which will “freeze” a fixed subsample: Sampler s = new Sampler(“my_sample_seed”,0.0,0.1,fromTs, toTs); AccumuloSortedMap frozenSample = map.sample(s); ● When you supply a sample_seed, you define an ordering of the keys by hash(sample_seed + key bytes). The same hash range within that ordering will produce the same sample. The fractions indicate the hash range.
  • 10. Map Aggregates Computed on Tablet Servers ● Aggregate functions are implemented using iterators that calculate aggregate quantities over the entire tablet server. The results are then combined locally. ● Similar to MapReduce with # mappers = # tservers and # reducers = 1. ● Examples of built-in aggregate methods : size(), checksum(), keyStats(), valueStats()
  • 11. Efficient One-to-Many Mapping ● AccumuloSortedMap can be configured to allow multiple values per key. ● Works by changing the VersioningIterator settings. ● SortedMap functions still work and see only the latest value. ● Extra methods give iterators over multiple values: – Iterator<V> getAll(Object key) – Iterator<Entry<K,V>> multiEntryIterator() ● All values for a given key will be stored on the same tablet server. This enables server-side per-row aggregates. Like SQL GROUP BY.
  • 12. One-to-Many Example map.setMaxValuesPerKey(-1); // unlimited map.put(1, 2); map.put(1, 3); map.put(1, 4); map.put(2, 22); AccumuloSortedMap<Number, StatisticalSummary> row_stats = map.rowStats(); StatisticalSummary row1= map.row_stats.get(1); row1.getMean(); // =3.0; row1.getMax(); // = 4.0 // count multiple values sizeAsLong(true); // = 4 //sum all values, looking at 1 value per key. 4 +22 map.valueStats().getSum(); // = 26.0 //sum all values, looking at multiple values per key. 2+3+4+22 map.valueStats(true).getSum(); // = 31
  • 13. Writing Custom Transformations and Aggregates ● Accumulo Collections provides useful abstract iterators that operate on deserialized java Objects. – Iterators are passed the SerDe classnames so that they can read the deserialized Objects. ● You can extends these iterators to implement your own transformations and aggregates. The API is very simple: abstract Object transformValue(Object k, Object v); abstract boolean allow(Object k, Object v);
  • 14. Example: Custom Javascript Tranformation As an example of custom transformations, consider ScriptTransformingIterator in the “experimental” package. You can pass javaScript code, which is interpreted on the tablet servers. The key and value bind to javaScript variables “k” and “v”. For example: Allow only entries with even keys: AccumuloSortedMap evens = map.jsFilter("k % 2 == 0"); Map of key → 3*value: AccumuloSortedMap tripled = map.jsTransform(" 3*v "); These examples work on keys and values that are java Numbers. Other javascript functions also work on Strings, java Maps, etc.
  • 15. Foreign Keys Accumulo Collections provides a serializable ForeignKey Object which is like a symbolic link that points to a map plus a key. There is no integrity checking of the link: map1.put("key1", "value1"); ForeignKey fk_to_key1 = map1.makeForeignKey("key1"); map2.put("key2", fk_to_key1); // both equals "value1" fk_to_key1.resolve(conn); map2.get("key2").resolve(conn);
  • 16. Using AccumuloSortedMapFactory ● The map factory is the preferred way to construct AccumuloSortedMaps. The factory is itself a map of (map name→ map metadata) with default settings. The factory: – acts as a namespace, mapping map names to real Accumulo table names. – Configures SerDes. – Configures other metadata like max_values_per_key.
  • 17. Factory Example AccumuloSortedMapFactory factory; AccumuloSortedMap map; factory = new AccumuloSortedMapFactory(conn,“factory_table”); // 10 values per key default for all maps factory.addDefaultProperty(MAP_PROPERTY_VALUES_PER_KEY , ”10” ); // 5000ms timeout in map “mymap” factory.addMapSpecificProperty(“mymap”, MAP_PROPERTY_TTL, ”5000”); map = factory.makeMap(“mymap”);
  • 18. More about SerDes ● Accumulo uses BytesWritable.compareTo() to compare keys on the tablet servers. – No way to set alternate comparator (?) ● Keys must be serialized in such a way that byte sort order is same as java sort order. ● FixedPointSerde, the default SerDe, writes Numbers in fixed point unsigned format so that numerical comparison works. Other Objects are java serialized.
  • 19. Bulk Import, Saving Dervied Maps ● The putAll and importAll methods in AccumuloSortedMap batch writes to Accumulo, unlike put(). You can save a derived map using putAll: map.putAll(someOtherMap); ● importAll() is like putAll, but take an Iterator as an argument. This can be used to import entries from other sources, like input streams and files. map.importAll(new TsvInputStreamIterator(“importfile.tsv”)); ● Aside from batching, putAll() and importAll() do not do anything special on the tablet servers. The import data all passes through the local machine to Accumulo. The optional KeyValueTransformer runs locally.
  • 20. Benchmarks ● I benchmarked Accumulo Collections against raw Accumulo read/writes on a toy Accumulo cluster running in Docker. All the moving parts of a real cluster, but running on one machine. ● All tests so far indicate that Accumulo Collections adds very little overhead (~10%) to normal Accumulo operation. ● I would appreciate it if someone sends me benchmarks from a proper cluster!
  • 21. Benchmark Data read write batched write unbatched 0 2 4 6 8 10 12 14 16 18 Raw Accumulo vs Accumulo Collections median time in ms, 10000 operations raw Acc Collections median time (ms)
  • 22. Performance Tips ● Batched writes are much faster. Use putAll() and importAll() in place of put() when possible. – Write your changes locally to a memory-based Map, then store in bulk with putAll(). ● Iterating over a range is much faster than lots of individual get() calls. – If you need to do lots of get() calls over a small submap, you can cache a map locally in memory with the localCopy() method.
  • 23. Contact Info ● I'm available for hire. You can email me at jwolff@isentropy.com. My consulting company, Isentropy, is online at https://isentropy.com . ● Accumulo Collections is available on Github at https://github.com/isentropy/accumulo-collections ● Constructive questions and comments welcome.