Apache Accumulo Overview

11
Apache Accumulo Overview
Bill Havanki
Solutions Architect, Cloudera Government Solutions

2 ©2014 Cloudera, Inc. All rights reserved.
2
•Quick History
•Storage Model
•Loading and Querying
•Daemons
•Getting Started, a.k.a., the Pitch
Agenda

Google BigTable
Compressed, high-performance, scalable,
distributed sorted map
4

Google BigTable
• Began development in 2004
• Built on Google File System
• Non-relational
• Byte-oriented and schemaless
• Stores data in the petabyte range
• Research paper published in 2006
5

Child(ren) of BigTable
• Apache HBase (begun 2006, top-level 2010)
• Apache Cassandra (begun 2008-ish, top-level 2010)
• Apache Accumulo ...
6

From Cloudbase to Accumulo
• Started in 2008 as National Security Agency project
• Submitted to Apache Incubator in 2011 (and renamed)
• Top-level project in 2012
7

Key / Value Store
Accumulo stores tables of key / value pairs
9

Key / Value Store
A row is a sorted sequence of key / value pairs
Each pair is a cell
10

The Key
11
row
column
timestamp
family qualifier visibility

An example key
12
bhavanki
column
1401041295
personal middle PII

Another example key
13
brees
column
1401041296
employment salary FIN

It’s all bytes
All key and value data are stored as bytes
except timestamp is a long
There are no built-in data types
but lexicoders help with common types
Key components are usually UTF-8 strings
14

Some rows for you
15
row cf cq cv ts value
bhavanki job employer 2013-09-01 Cloudera
bhavanki personal beer 2013-09-15 Omission
bhavanki personal house NOMUGGL 2014-01-25 Ravenclaw
brees job employer 2013-10-01 White Cliffs
brees personal house NOMUGGL 2014-01-01 Hufflepuff

Visibility Labels
Boolean expression
Specialist | (Management & SpecTraining)
Authorizations are provided in each scan
16

Locality Groups
You can identify sets of one or more column families as
locality groups
Data in a locality group is stored together for improved
read performance
17

Tablets
A table is comprised of one or more tablets
18
employeesemployees
employees;Semployees;Semployees;Hemployees;H employees;~employees;~

Tablets
Tablets maps to data files in HDFS
19
employees;Semployees;Semployees;Hemployees;H employees;~employees;~
rfile 2rfile 2rfile 1rfile 1 rfile 3rfile 3

Tablets
Data also kept in write-ahead logs and memtable
20
employees;Hemployees;H
rfile 1rfile 1
walogswalogs
memtablememtable

Java Client API
22

Java Client API
Read using scanners
Scanner s = conn.createScanner(“employees”, new
Authorizations());
s.setRange(“alice”, “eve”);
s.setColumnFamily(“personal”);
for (Entry<Key, Value> e : s)
employeeIds.add(e.getKey().getRow());
23

Java Client API
Read access via iterator pattern
• server-side system iterators handle timestamps,
authorization checks, and lots more
• iterators almost always wrap other iterators, forming a
chain
• you can define your own, client-side or server-side
24

Java Client API
Scanners fetch sorted rows from one range
Batch scanners fetch unsorted rows from multiple
ranges in parallel
Isolated scanners ensure that you do not see a row
mid-change
25

MapReduce
AccumuloInputFormat
AccumuloOutputFormat
26

MapReduce
AccumuloRowInputFormat
AccumuloRowOutputFormat
27

Shell
Command-line / manual access to Accumulo data
• scan, insert, delete
• iterator management
• table management (creation, deletion, cloning)
• user and authorization management
• table splitting and merging
• ... more
28

Bulk Import
Got lots of data to import quickly?
• Use MR job to format data using
AccumuloFileOutputFormat
• Import files using shell
Trade off latency / availablity for throughput
29

Tablet Server
Serves tablets (table data)
• writes data to walog, memtable; deals with compaction
• serves data for reads from files, memtable
• handles recovery from walogs in case of server failure
Most client calls go to tablet servers
31

Master
• assigns tablets to tablet servers
• detects tablet server failures and reassigns tablets
• balances tablet assignments over time
• coordinates table operations
Multiple supported for failover, only one active
32

Everybody Else in Accumulo
Garbage Collector (GC) - identifies and deletes files in
HDFS that are no longer needed
Tracer - listens for and stores distributed trace messages
using a special table
33

Everybody Else in Accumulo
• Monitor - collects and serves status information
• server status
• log inspection
• performance data
• table inspection
34

Everybody Else outside Accumulo
• HDFS (as part of Apache Hadoop)
• stores tablet files
• stores write-ahead logs (1.5+)
• MapReduce (Hadoop)
• bulk import
• batch processing
• Apache ZooKeeper
35

36
Getting Started
a.k.a. the Pitch
36

Easy as 1-2-3?
1.Install Hadoop (HDFS and MapReduce)
2.Install ZooKeeper
3.Install Accumulo!
37

Making Steps 1 and 2 Easier
Use a complete, pre-packaged Hadoop distribution
... like CDH!
a leading commercial distribution centered on Apache
Hadoop
•many ecosystem components
•configured / updated to work together
38

Making Steps 1 and 2 Easier
Cloudera Manager
•deployment
•configuration
•operation
•security
39

Making Step 3 Easier
Standard Apache Accumulo installation is via tarball
• no longer shipping RPM / DEB / ...
Using CDH/CM you can use:
• a tarball, RPM or DEB with Accumulo packaged for CDH
• a parcel (like RPM / ZIP) for easier upgrades
• 1.4.4 and 1.4.5 available now
• 1.6.0 soon
40

Where to Go for More
• http://accumulo.apache.org/
• http://www.cloudera.com/content/cloudera/en/products-and-service
• http://www.cloudera.com/content/cloudera/en/products-and-service
• http://www.cloudera.com/content/cloudera/en/products-and-
services/cdh/accumulo.html
41

Accumulo Summit
Join us on June 12
42

Quick Thanks
• My slide reviewers
• Sean Busbey
• Mike Drob
• Accumulo community
• You all for listening
43

Thank you!
Bill Havanki
bhavanki@clouderagovt.com
44

Apache Accumulo Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache Accumulo Overview

Similar to Apache Accumulo Overview (20)

Recently uploaded

Recently uploaded (20)

Apache Accumulo Overview