Accumulo meetup 20130109

Securely explore your data

APPACHE
ACCUMULO
Adam Fuchs and John Vines
sqrrldata, inc.
January 9, 2013

APACHE ACCUMULO

Sorted, Distributed Key/Value Store
Based on Google’s Big Table Design
Built on Top of Apache Hadoop and Apache Zookeeper
Augments and Integrates With the Hadoop ecosystem

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential

2

TODAY’S TALK
Overview of the Accumulo Project
Accumulo Design
Table Design Strategy
Live Demonstration


3

ACCUMULO TIMELINE
NSA open
sources
Accumulo into
incubation at
Apache

Google
publishes
Bigtablepaper

2005

2006

Google Publishes
Papers:
GFS (2003)
Map Reduce (2004)

2007

2008

2009

NSA begins
development of
Accumulo


Accumulo becomes
a top-level Apache
project

2010

2011

First sqrrl
release planned

2012

2013

sqrrl is founded

4

ACCUMULO’S STRENGTHS
Apache Accumulo excels at:
- Security
Cell-level security reduces the cost of application development in
the presence of complex legal or policy restrictions on data use
Mandatory access control keeps your data safe
- Scalability
Proven reliability and performance at the multi-petabyte scale
High-performance parallel I/O library
- Adaptability
Flexible schema support to quickly ingest new data sources
Sorted key/value paradigm supports a multitude of search and
analysis applications
Server-side programming framework “iterator trees” support bestin-class aggregation, filtering, and complex query semantics

5

BASIC SCHEMA
Accumulo stores sorted key/value pairs (entries).

An Accumulo key is a 5-tuple, consisting of:
- Row: Controls Atomicity
- Column Family: Controls Locality
- Column Qualifier: Controls Uniqueness
- Visibility Label: Controls Access
- Timestamp: Controls Versioning

Keys are sorted:
-Hierarchically: Row first, then column family, and so on.
- Lexicographically: Compare first byte, then second, and so on.

Values are byte arrays.


6

KEY/VALUE EXAMPLES
Row

Col. Fam.

Col. Qual.

John Doe

Visibility

JD

Timesta
mp

Value

Jane Doe

Friends

20121130

Jane Doe

PhoneNumbe
555-1212
r

John Doe

Friends

Jane Doe

JD

20121201

John Doe

Notes

PCP

PCP_JD

20120912

Patient suffers
from an acute …

John Doe

Test Results

Cholesterol

JD|PCP_JD

20120912

183

John Doe

Test Results

Mental Health

JD|PSYCH_JD

20120801

Pass

John Doe

Test Results

Mental Health

PSYCH_JD

20120801

Crazy!

John Doe

Test Results

X-Ray

JD|PHYS_JD

20120513

1010110110100
…

20090115


7

VISIBILITY SYNTAX & SEMANTICS


8

TABLET ORGANIZATION
Well-Known
Location
(zookeeper)

Collections of entries from tables
Tables are partitioned into Tablets
Metadata tablets hold info about other tablets,
forming a 3-level hierarchy
A Tablet is a unit of work for a Tablet Server

Root Tablet
-∞ to ∞

Metadata Tablet 1

Metadata Tablet 2

-∞ to
“Encyclopedia:Ocelot”

“Encyclopedia:Ocelot” to ∞

Table: Adam’s Table
Data Tablet
-∞ : thing

Data Tablet
thing : ∞

Table: Encyclopedia
Data Tablet
-∞ : Ocelot


Data Tablet
Ocelot : Yak

Data Tablet
Yak : ∞

Table: Foo
Data Tablet
-∞ to ∞

9

ACCUMULO ARCHITECTURE
Zookeeper

Zookeeper

Delegate
Authority,
Configs

Zookeeper
Delegate
Authority,
Configs

Tablet Server

Tablet
Read/Write
Assign/Balance

Tablet Server

Master

Application

Application
Tablet
Store/Replicate

Tablet Server

Application

HDFS
Scan

Delete

Tablet

Garbage
Collector

10

TABLET DATA FLOW
Tablet
Writes

In-Memory
Map

Scan

Iterator
Tree

Iterator
Tree

Minor
Compaction

Sorted,
Indexed
File

Sorted,
Indexed
File
Write Ahead
Log
(For Recovery)

Iterator
Major Tree

Merging /
Compaction


Reads

Sorted,
Indexed
File

11

ITERATOR FRAMEWORK
Iterator Operations:
- File Reads
- Block Caching
- Merging
- Deletion
- Isolation
- Locality Groups
- Range Selection
- Column Selection
- Cell-level Security
- Versioning
- Filtering
- Aggregation
- Partitioned Joins


12

CLIENT API
new ZooKeeperInstance(...)

Instance

new MockInstance()

getConnector(auth info...)

Range
IteratorOption

Connector

TableOperations

Authorizations

InstanceOperations

createScanner(...)

createBatchScanner(...) createBatchWriter(...)

SecurityOperations

Scanner

BatchScanner

BatchWriter

iterator()
addMutation(...)

Map.Entry
Key

Mutation

Value


13

TABLE DESIGN
Table:

Graphs
Document-distributed indexing
Multi-dimensional index
Custom index


Inverted Index

Row:

<UUID>

<Term>

Column
Family:

<Type>

<Type> +
<Field>

Column
Qualifier:

<Field>

<UUID>

Value:

No built-in secondary indices
Sort Order  Index
Basic design pattern: forward and
inverted index tables
Additional table design patterns

Forward Index

<Term>

<Digest of
Event>

14

DEMO TIME!


15

OPEN SOURCE PROJECT
Apache Software Foundation project since October 2011
site: http://accumulo.apache.org
jira: https://issues.apache.org/jira/browse/ACCUMULO
lists: http://accumulo.apache.org/mailing_list.html


16

CURRENT CONTRIBUTORS


17

CONTACT

Adam Fuchs
CTO
adam@sqrrl.com

John Vines
Director of Ecosystems
john@sqrrl.com

sqrrl data, Inc.
www.sqrrl.com
@sqrrl_inc
info@sqrrl.com

18

Accumulo meetup 20130109

More Related Content

What's hot

Similar to Accumulo meetup 20130109

More from Sqrrl

Recently uploaded

Accumulo meetup 20130109

Editor's Notes