Accumulo on EC2

Accumulo in the Cloud
Amazon’s EC2

Time
Row Column Visibility Value
Stamp
family |
Joe photos:vacation - photo425.jpg
friends

Joe photos:expo work - photo648.jpg

photos:
Joe friends - photo772.jpg
bachelor_party

Time
Stamp
family |
friends


photos:
bachelor_party

FRIENDS

Time
Stamp
family |
friends


photos:
bachelor_party

FAMILY

Time
Stamp
family |
friends


photos:
bachelor_party

WORK

Combiners:
Server-side Computation

Time
Stamp
storeA sales:shoes acct 20120305 100



storeA sales:cameras acct 20120305 100

storeA sales:cameras acct 20120303 200

COMBINER: SUM()

Time
Stamp
storeA sales:shoes acct - 950

storeA sales:cameras acct - 300

COMBINER: SUM()

Time
Stamp



COMBINER: SUM()

Time
Stamp


COMBINER: SUM()

10s of thousands per
second per server

Accumulo Leading brand

0 7500 15000 22500 30000

Inserts

Updates

WRITE PERFORMANCE
PROJECT

APPROXIMATE
DATE CLIENT
DATE NAME

Independence
DIFFERENT MACHINES
DATA CENTERS
GEOGRAPHIC LOCATIONS

They’re yours
until you’re outbid

Are Big Data and Cloud a
match made in heaven?

MATCH

Big Data Cloud

Always need more hardware, Add more machines as
not alway easy to predict needed

MATCH

Big Data Cloud

Need so much hardware,
Sells exclusively commodity
have to use lots of
virtual servers that blow up
commodity machines and
occasionally
fault-tolerant software

MATCH

Big Data Cloud

Having lots of independent
I/O (IOPs and bandwidth) is
Offers disks local to the
important, both for
instance
MapReduce and servicing
requests

MISMATCH

Big Data Cloud

Having lots of independent
I/O (IOPs and bandwidth) is
Offers shared storage over
important, both for
ethernet (SAN)
MapReduce and servicing
requests

MISMATCH

Big Data Cloud

Large volumes of data have Create an entire cluster from
‘mass’, get harder to move nothing, bring all the data in,
around process, write it all out

MISMATCH, BUT OK

Big Data Cloud

Software beneﬁts from lots
Heavily reliance on
of independent hardware,
virtualization
and using all the hardware

Scalable administration
and elasticity

Is admin intervention
required?

Must data move before it
can service requests?

Accumulo
FAILOVER AUTOMATIC
RE-REPLICATION AUTOMATIC
DATA REPLICATED ASYNCHRONOUSLY
REQUEST-LOAD BALANCED INDEPENDENTLY
NEW MACHINES DISCOVERED AUTOMATICALLY

Some Other NoSQL dbs ...
NOT SO MUCH

Your DB needs to scale in
the human dimension too.

no 32-bit t1.micros,
m1.small, or m1.medium

7.5GB RAM for 4 cores*
*EC2 COMPUTE UNITS

Bigger instances have
more RAM, CPU ...

Might get a machine with
more memory

Use RAID across multiple
NameNode disks

Availability Zone
= Data Center

Tested on four AZs
in East US region

Ubuntu
Maverick (10.10) x86_64

Cloudera’s Whirr
HTTPS://CCP.CLOUDERA.COM/DISPLAY/CDHDOC/WHIRR+INSTALLATION

wget
HTTP://INCUBATOR.APACHE.ORG/ACCUMULO/DOWNLOADS

Use internal IP addresses
for important machines

2181,2888,3888 ZOOKEEPER

Ports 4560 ACCUMULO MONITOR
9000 HDFS
9001 JOBTRACKER
9997 TABLET SERVER
9999 MASTER SERVER
11224 ACCUMULO LOGGER
12234 ACCUMULO TRACER
50010 DATANODE DATA
50020 DATANODE METADATA
50060 TASKTRACKERS
50070 NAMENODE HTTP MONITOR
50075 DATANODE HTTP MONITOR
50091 ACCUMULO
50095 ACCUMULO HTTP MONITOR

Create SSH key on
Accumulo Master

Follow ordinary Hadoop
and Accumulo conﬁg steps

Only need to be identically
conﬁgured

Could store conﬁg ﬁles
into an AMI

Start Accumulo loggers
and Tablet Server

Tablet servers assigned
tablets immediately

Some new blocks will go
to new machines

Scaling Up 20
40
100
200
400

Scaling Up 20
40
ABOUT 85% INCREASE IN WRITE RATE 100
EACH TIME CLUSTER SIZE DOUBLED
200
400

Scaling Up 20
40
HIT 1 MILLION WRITES PER SECOND 100
AT 100 M1.LARGES WITH 50 CLIENTS
200
400

m1.large NameNode
served 400 node cluster

Accumulo auto-recovers
using write-ahead logs

Occasionally provision
replacement machines

Clients see no exceptions /
errors

Watch writes and reads
scale on monitor page
HTTP://ACCUMULO-MONITOR:50095

See a list of failed
machines too

May need up to 2x VMs vs.
bare metal

Identify a set of machines
to remove

accumulo admin stop <node>
IN VERSION 1.4

Decommission HDFS
datanodes
HTTP://DEVELOPER.YAHOO.COM/HADOOP/TUTORIAL/MODULE2.HTML

NameNode will
re-replicate blocks

Remaining machines have
enough storage?

Can lower replication
factor if necessary

When decommissioning is
done, terminate instances

Details
HTTP://WWW.ACCUMULODATA.COM/EC2.HTML

Accumulo on EC2

More Related Content

What's hot

Viewers also liked

Similar to Accumulo on EC2

Recently uploaded

Accumulo on EC2

Editor's Notes