Apache Accumulo is a sorted, distributed key-value store built on top of Apache Hadoop, Zookeeper, and Thrift. It is based on Google's BigTable design with some improvements like cell-based access control and server-side programming capabilities. Accumulo stores data based on a row key, column family, qualifier, and visibility label, and can perform fast lookups and scans of large datasets in a distributed environment due to its scalable architecture.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
An Introduction to Accumulo
1. AN INTRODUCTION TO
APACHE ACCUMULO
HOW IT WORKS, WHY IT EXISTS,AND HOW IT IS USED
Donald Miner
CTO, ClearEdge IT Solutions
@donaldpminer
August 5th, 2014
2. The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
3. The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
Adelaide Bartkowski
Alyssa Files
Beatriz Palmore
Cecilia Ours
Craig Avalos
Dianna Lapointe
Erma Davis
Fermina Smead
Garrett Harsh
Gaylene Sherry
Gilberto Pardue
Hui Nodal
Janell Tomita
Jannette Betters
Jeana Delk
Madlyn Radke
Peggie Allis
Rhona Zygmont
Tran Degarmo
Wilhelmina Papp
4. The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
Janell Tomita
Jannette Betters
Jeana Delk
Madlyn Radke
Peggie Allis
Rhona Zygmont
Tran Degarmo
Wilhelmina Papp
Adelaide Bartkowski
Alyssa Files
Beatriz Palmore
Cecilia Ours
Craig Avalos
Dianna Lapointe
Erma Davis
Fermina Smead
Garrett Harsh
Gaylene Sherry
Gilberto Pardue
Hui Nodal
-inf to D E to H J to +inf
5. The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
Accumulo Master
TabletServer TabletServer TabletServer
ZooKeeper
6. The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
KEY VALUE
Adelaide Bartkowski 91294124
Alyssa Files 491294
Beatriz Palmore 4124124124
Cecilia Ours 419120
Craig Avalos 940124
Dianna Lapointe 4921
Erma Davis 050194
Fermina Smead 10024599949
Garrett Harsh 140095931
Gaylene Sherry 914815
Gilberto Pardue 412414124124
Hui Nodal 962195192
Janell Tomita 12121
Jannette Betters 9192012
Jeana Delk 9120150
Madlyn Radke 4921
Peggie Allis 944944
Rhona Zygmont 123103
Tran Degarmo 9499494
Wilhelmina Papp 11221
Lookup “Garret Harsh”
FAST
Lookup “4921”
SLOW
7. The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
8. The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
9. The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
10. The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
11. The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
12. The Apache Accumulo sorted, distributed key/value store is
a robust, scalable, high performance data storage and
retrieval system.
MIT Lincoln Lab study:
100 Million inserts per second using Accumulo
http://arxiv.org/ftp/arxiv/papers/1406/1406.4923.pdf
http://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf
Booz Allen Hamilton study:
942 tablet servers, 7.56 trillion entries, 408TB, 26 hours
94MB/Sec, 15TB/hr, 80million inserts per second
11 tablet servers went down with no interruption
Showed linear scalability for write throughput
22,000 queries per second
13. Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
14. Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
15. Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
16. Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
17. HBase vs. Accumulo
• Slight differences in visibility labels
• Coprocessors vs. Iterators
• Accumulo has faster write throughput*
• HBase’s reads are faster*
• HBase has more ecosystem integration
• BatchScanner
• Accumulo can shift around locality groups after the fact
• Accumulo has shown to work with no problems at 1,000
nodes (BAH paper). Facebook and others run a “cell”
design for HBase. Largest clusters in the hundreds*.
* We believeDisclaimer: I am biased
18. Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
19. Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
(admin & developer) | analyst
20. Column Visibility Syntax
Label Description
A & B Both ‘A’ and ‘B’ are required
A | B Either ‘A’ or ‘B’ is required
A & (C | B) ‘A’ and ‘C’ or ‘A’ and ‘B’ is required
A | (B & C) ‘A’ or ‘B’ and ‘C’ is required
(A | B) & (C & D) ?
A & (B & (C | D)) ?
Patient has schizophrenia: insurer | MD & psych
Patient has stomach ulcers: insurer | doctor
Patient has cavity: insurer | dentist
Patient has consent for general anesthesia: surgeon
21. Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
22. Apache Accumulo is based on Google's BigTable design
and is built on top of Apache Hadoop, Zookeeper, and
Thrift. Apache Accumulo features a few novel
improvements on the BigTable design in the form of cell-
based access control and a server-side programming
mechanism that can modify key/value pairs at various
points in the data management process. Other notable
improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several
other open source projects have implemented aspects of
this design including HBase, Hypertable, and Cassandra.
Accumulo began its development in 2008 and joined the
Apache community in 2011.
23. More cool features
• Constraints: user-defined Java functions that allow or
prevent new writes based on a condition
• Large rows: no limit on data stored in a row
• Multiple masters & FATE: able to execute table operations
in a fault-tolerant manner
• MapReduce InputFormats
• Bulk import utilities: write directly to Accumulo file formats
• Batch scanner: client scans multiple ranges at once
• Batch writer: client buffers and organized data before
writing in parallel
24. More cool features
• Constraints: user-defined Java functions that allow or
prevent new writes based on a condition
• Large rows: no limit on data stored in a row
• Multiple masters & FATE: able to execute table operations
in a fault-tolerant manner
• MapReduce InputFormats
• Bulk import utilities: write directly to Accumulo file formats
• Batch scanner: client scans multiple ranges at once
• Batch writer: client buffers and organized data before
writing in parallel
25. More cool features
• Thrift proxy: access Accumulo through Ruby, Python, …
• Monitor page: shows performance, status, errors, more
• Locality groups: group column families together on disk
for performance tuning (changeable later)
• On-HDFS at rest encryption (work in progress)
• Table import and export
26. More cool features
• Thrift proxy: access Accumulo through Ruby, Python, …
• Monitor page: shows performance, status, errors, more
• Locality groups: group column families together on disk
for performance tuning (changeable later)
• On-HDFS at rest encryption (work in progress)
• Table import and export
27. Scalability & Performance
• Multiple HDFS volumes: Accumulo can use multiple
NameNodes to store its data
• Master stores metadata in an Accumulo table
• Native in-memory map: data is first written into a buffer
written in C++, outside of Java
• Relative encoding: consecutive keys with the same values
are flagged instead of rewritten
• Scan pipelines: stages of the read path are parallelized
into separate threads
• Caching: data recently scanned is cached
29. Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
30. Data Model
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public | private 12423523 @donaldpminer
don info height public | private 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
Name email twitter picture height SSN
derek de…@ad….com 9efe23aa… 6’2”
don dm…@cl….com @donaldpminer 5’ 9”
erica @erica aef319eaf…
31. Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Lookup key
32. Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Collection of data that is kept together
33. Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
What the data is
34. Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Who can see the data
35. Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
When the data was created
36. Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
UNIQUENESS
37. Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
SORTED
38. Data Model
KEY
ROW ID
COLUMN
FAMILY QUALIFIER VISIBILITY
VALUE
Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
TIMESTAM
P
Some piece of information
39. Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info SSN private 12314514 123-45-6789
erica … … … … …
Row ID Family Qualifier Visibility Timestamp Value
don info picture public 13119103 dd3ae1d3b951a33f…
Writing data into Accumulo
40. Row ID Family Qualifier Visibility Timestamp Value
don info picture public 13119103 dd3ae1d3b951a33f…
Writing data into Accumulo
Text rowID = new Text(”don");
Text colFam = new Text(”info");
Text colQual = new Text(”picture");
ColumnVisibility colVis = new ColumnVisibility("public");
long timestamp = System.currentTimeMillis();
Value value = new Value(MyPictureObj.getBytes());
Mutation mutation = new Mutation(rowID);
mutation.put(colFam, colQual, colVis, timestamp, value);
BatchWriterConfig config = new BatchWriterConfig();
BatchWriter writer = conn.createBatchWriter(”usertable", config)
writer.add(mutation);
writer.close();
41. Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Row ID Family Qualifier Visibility Timestamp Value
don info picture public 13119103 dd3ae1d3b951a33f…
Writing data into Accumulo
45. Writing data into Accumulo
Write
Ahead
Log
(WAL)
New
Record
MemTable
46. Writing data into Accumulo
Write
Ahead
Log
(WAL)
New
Record
MemTable
RFile
(minc)
sorted
Minor Compaction
47. Writing data into Accumulo
Write
Ahead
Log
(WAL)
New
Record
MemTable
RFile
(minc)
RFile
(minc)
Minor Compaction
48. Writing data into Accumulo
Write
Ahead
Log
(WAL)
New
Record
MemTable
RFile
(minc)
RFile
(minc)
RFile
(minc)
Minor Compaction
49. Writing data into Accumulo
RFile
(majc)
RFile
(minc)
RFile
(minc)
RFile
(minc)
sorted
Major Compaction
50. Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Range Family Visibilities
don-don info public
Reading data
51. Range Family Visibilities
don-don info public
Reading data
Authorizations auths = new Authorizations("public”);
Scanner scan = conn.createScanner(”usertable", auths);
scan.setRange(new Range(”don",”don"));
scan.fetchFamily(”info");
for(Entry<Key,Value> entry : scan) {
String row = entry.getKey().getRow();
Value value = entry.getValue();
}
53. Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Range Family Visibilities
don-don info public, user, tech
Reading data
54. Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Range Visibilities
don-don public, user, tech
Reading data Scan
55. Row ID Family Qualifier Visibility Timestamp Value
derek … … … … …
don contact email admin | private 11905014 dminer@gopivotal.com
don contact email admin | private 12412412 dminer@clearedgeit.com
don contact email public 12412412 dm…@cl....com
don contact twitter public 12423523 @donaldpminer
don info height public 12314514 5’ 9”
don info picture public 13119103 dd3ae1d3b951a33f…
don info SSN private 12314514 123-45-6789
erica … … … … …
Range Visibilities
d-e public, user, tech
Reading data Scan
56. Iterators
• Iterators run tablet server side at these times:
1. Scan Time
2. Minor Compaction
3. Major Compaction
• Multiple iterators are included with Accumulo
• Custom iterators can be created using the Iterator API
61. Combiner Iterators
Apply a function to all available versions of a particular key
Row
ID
Column
Family
Column
Qualifier
Column
Visibility
Time
Stamp
Value
bob attribute score public 1005 33
bob attribute score public 1004 65
bob attribute score public 1003 71
bob attribute score public 1002 59
bob attribute score public 1001 57
bob attribute score public 1000 51
MAX 71
Scan time: server side combining Minor & Major compaction time: consolidation
63. Basic Structured Data
Row ID
Column
Family
Column
Qualifier
Column
Visibility
Timestam
p
Value
bob attribute surname public Jul 2013 doe
bob attribute height public Jun 2012 5’11”
bob insurance dental private Sep 2009 MetLife
jane attribute bloodType public Jul 2011 ab-
jane attribute surname public Aug 2013 doe
jane contact cellPhone public Dec 2010 (808) 345-
9876
jane insurance vision private Jan 2008 VSP
john allergy major private Feb 1988 amoxicillin
john attribute weight public Sep 2013 180
john contact homeAddr public Mar 2003 34 Baker LN
64. Indexing Everything
Row
ID
Column Fam Column Qual Visibility Time value
index Column Fam Column Qual:Row ID Visibility Time -
to Column Fam Column Qual:Row ID Visibility Time -
values Column Fam Column Qual:Row ID Visibility Time -
Event Table
Index Table
65. Index Table
Row ID
Column
Family
Column
Qualifier
Column
Visibility
Timestam
p
Value
(808) 345-
9876
contact cellPhone:jane public Dec 2010 -
180 attribute weight:john public Sep 2013 -
34 Baker LN contact homeAddr:john public Mar 2003 -
5’11” attribute height:bob public Jun 2012 -
MetLife insuranc
e
dental:bob private Sep 2009 -
VSP insuranc
e
vision:jane private Jan 2008 -
ab- attribute bloodType:jane public Jul 2011 -
amoxicillin allergy major:john private Feb 1988 -
doe attribute surname:bob public Jul 2013 -
doe attribute surname:jane public Aug 2013 -
68. Data Lake
PATIENTS DISEASES DOCTORS
INDEX
amoxicillin
bob:allergy:amoxicillin
larry:takes:amoxicillin
Stomach ulcer:
treatment:amoxicillin
smith:
prescribed:amoxicillinInfection:
treatment:amoxicillin
Diarrhea:
side effect:amoxicillin
69. Graphs
a
bc
d
e
a b c d e
a - 1
b 1 -
c - 1
d 1 1 - 1
e -
Start Nodes
EndNodes
Row ID Column Family Column Qualifier Value
a edge b 1
a edge d 1
c edge a 1
c edge d 1
d edge c 1
e edge d 1
70. Term-Partitioned Index
Tablet Server 1
Row ID
Column
Family
Value
baseball document docid_3
baseball document docid_2
bat document docid_2
Tablet Server 2
Row ID
Column
Family
Value
football document docid_1
football document docid_3
glove document docid_1
Tablet Server 3
Row ID
Column
Family
Value
nba document docid_1
shoes document docid_1
soccer document docid_3
RESULTS: [docid_2, docid_3] RESULTS: [docid_1, docid_3] RESULTS: [docid_3]
Tablet Server knows about
the terms “baseball”
Tablet Server knows about
the terms “football”
Tablet Server knows about
the terms “soccer”
Query: “baseball” AND “football” AND “soccer”
Client
Client-side Set
Intersection
[docid_2, docid_3]
[docid_1, docid_3]
[docid_3]
75. AN INTRODUCTION TO
APACHE ACCUMULO
HOW IT WORKS, WHY IT EXISTS,AND HOW IT IS USED
Donald Miner
CTO, ClearEdge IT Solutions
@donaldpminer
August 5th, 2014
Editor's Notes
Two basic operators
AND operator represented by &
OR operator represented by |
In the examples A,B, C, and D are security tokens
Security Tokens are strings of alphanumeric characters
Tokens are user defined
Parenthesis are required to use nested logic
A Minor Compaction is triggered when the Tablet’s MemTable reaches it’s maximum size
When the MemTable reaches it’s maximum size, it is flushed
A Minor Compaction Iterator is applied during the stage when the MemTable is flushed and a new RFile is created
Since the iterator is applied during a Minor Compaction, the iterator does affect the persistence of the data
A Major Compaction periodically merges as set of RFiles into one
If a Major Compaction iterator is enabled, the iterator runs after the merge to filter data before writing the new RFile
Since the iterator is applied during a Minor Compaction, the iterator does affect the persistence of the data