Accumulo in the Cloud
Amazon’s EC2
Apache Accumulo
BigTable
Java
Apache Hadoop
Advanced Features
Cell-level security
Time
Row      Column       Visibility          Value
                                 Stamp
                       family |
Joe   photos:vacation               -  photo425.jpg
                       friends

Joe    photos:expo     work       -    photo648.jpg

         photos:
Joe                    friends    -    photo772.jpg
      bachelor_party
Time
Row      Column       Visibility          Value
                                 Stamp
                       family |
Joe   photos:vacation               -  photo425.jpg
                       friends

Joe     photos:expo     work      -    photo648.jpg

          photos:
Joe                     friends   -    photo772.jpg
       bachelor_party




      FRIENDS
Time
Row      Column       Visibility          Value
                                 Stamp
                       family |
Joe   photos:vacation               -  photo425.jpg
                       friends

Joe    photos:expo     work       -    photo648.jpg

         photos:
Joe                    friends    -    photo772.jpg
      bachelor_party




      FAMILY
Time
Row      Column       Visibility          Value
                                 Stamp
                       family |
Joe   photos:vacation               -  photo425.jpg
                       friends

Joe    photos:expo     work       -    photo648.jpg

         photos:
Joe                    friends    -    photo772.jpg
      bachelor_party




      WORK
Combiners:
Combiners:
Server-side Computation
Time
Row        Column        Visibility              Value
                                       Stamp
storeA    sales:shoes      acct       20120305   100

storeA    sales:shoes      acct       20120303   550

storeA    sales:shoes      acct       20120302   300

storeA   sales:cameras     acct       20120305   100

storeA   sales:cameras     acct       20120303   200


              COMBINER: SUM()
Time
Row        Column        Visibility           Value
                                      Stamp
storeA    sales:shoes      acct         -     950

storeA   sales:cameras     acct         -     300




              COMBINER: SUM()
Time
Row        Column        Visibility               Value
                                       Stamp
storeA    sales:shoes      acct           -       950

storeA   sales:cameras     acct           -       300




storeA    sales:shoes      acct       200120306   150


              COMBINER: SUM()
Time
Row        Column        Visibility           Value
                                      Stamp
storeA    sales:shoes      acct         -     1100

storeA   sales:cameras     acct         -     300




              COMBINER: SUM()
Combiners execute
Query time
Asynchronously in
background
Insert / updates are fast
10s of thousands per
second per server
Accumulo                       Leading brand



                     0     7500            15000   22500        30000


           Inserts




          Updates




          WRITE PERFORMANCE	
PROJECT




          APPROXIMATE
DATE                              CLIENT
          DATE                              NAME
Amazon
Elastic Compute Cloud
Really fast provisioning
On-demand hardware
Scale up/down in minutes
Independence
  DIFFERENT MACHINES
  DATA CENTERS
  GEOGRAPHIC LOCATIONS
Spot Instances
Set max price
for instances
They’re yours
until you’re outbid
SPOT INSTANCE PRICING HISTORY
Are Big Data and Cloud a
match made in heaven?
MATCH

         Big Data                    Cloud




Always need more hardware,    Add more machines as
  not alway easy to predict         needed
MATCH

        Big Data                      Cloud




Need so much hardware,
                            Sells exclusively commodity
    have to use lots of
                            virtual servers that blow up
commodity machines and
                                     occasionally
  fault-tolerant software
MATCH

         Big Data                      Cloud




 Having lots of independent
I/O (IOPs and bandwidth) is
                              Offers disks local to the
     important, both for
                                      instance
 MapReduce and servicing
          requests
MISMATCH

         Big Data                      Cloud




 Having lots of independent
I/O (IOPs and bandwidth) is
                              Offers shared storage over
     important, both for
                                    ethernet (SAN)
 MapReduce and servicing
          requests
MISMATCH

         Big Data                       Cloud




Large volumes of data have Create an entire cluster from
‘mass’, get harder to move nothing, bring all the data in,
          around              process, write it all out
MISMATCH, BUT OK

        Big Data                   Cloud




Software benefits from lots
                             Heavily reliance on
of independent hardware,
                               virtualization
and using all the hardware
Scalable administration
and elasticity
When a machine fails
Is admin intervention
required?
Will clients see
exceptions?
When adding a machine
Must data move before it
can service requests?
Must processes be
restarted?
Accumulo
FAILOVER AUTOMATIC
RE-REPLICATION AUTOMATIC
DATA REPLICATED ASYNCHRONOUSLY
REQUEST-LOAD BALANCED INDEPENDENTLY
NEW MACHINES DISCOVERED AUTOMATICALLY
Some Other NoSQL dbs ...
NOT SO MUCH
Your DB needs to scale in
the human dimension too.
Accumulo in EC2
Instance Types
64-bit machines
no 32-bit t1.micros,
m1.small, or m1.medium
about 2-4GB RAM / core
about 2 disks / core
m1.larges
7.5GB RAM for 4 cores*
            *EC2 COMPUTE UNITS
2 x 420GB disks
m1.xlarge
15GB RAM for 8 cores
4 x 414GB disks
m2.xlarge
17.1GB RAM for 6.5 cores
only 1 x 420GB disk
Bigger instances have
more RAM, CPU ...
Not more disks
EBS is an option ...
RAID is not necessary
Lose some independence
Exceptions
HDFS NameNode
Might get a machine with
more memory
68GB is the current max
Use RAID across multiple
NameNode disks
Where to place instances?
Region = geographic area
Availability Zone
= Data Center
Can span multiple AZs
Tested on four AZs
in East US region
Spanning regions?
Cross-site/WAN replication
AMIs
Ubuntu
Maverick (10.10) x86_64
CentOS
Cloudera Hadoop
Cloudera’s Whirr
HTTPS://CCP.CLOUDERA.COM/DISPLAY/CDHDOC/WHIRR+INSTALLATION
OS Config
No swapping
Swappiness = 0
Up open file limits to 64k
Software
Hadoop
Only need HDFS
MapReduce is fine though
Version 0.20
Cloudera CDH3u2
or MapR!
ZooKeeper
Version 3.3.1 or greater
Java
Sun Java JDK 1.6
OpenJDK 1.6
Accumulo 1.3.5 or 1.4.0
wget
HTTP://INCUBATOR.APACHE.ORG/ACCUMULO/DOWNLOADS
Configuration
Use internal IP addresses
for important machines
HDFS NameNode
Accumulo Master
MapReduce JobTracker
EC2 Security Groups
a.k.a. the Firewall
2181,2888,3888 ZOOKEEPER

Ports   4560 ACCUMULO MONITOR
        9000 HDFS
        9001 JOBTRACKER
        9997 TABLET SERVER
        9999 MASTER SERVER
        11224 ACCUMULO LOGGER
        12234 ACCUMULO TRACER
        50010 DATANODE DATA
        50020 DATANODE METADATA
        50060 TASKTRACKERS
        50070 NAMENODE HTTP MONITOR
        50075 DATANODE HTTP MONITOR
        50091 ACCUMULO
        50095 ACCUMULO HTTP MONITOR
Create SSH key on
Accumulo Master
Distribute to tablet servers
Follow ordinary Hadoop
and Accumulo config steps
Scaling Up
Provision new instances
Only need to be identically
configured
Could store config files
into an AMI
Start HDFS data node
Start Accumulo loggers
and Tablet Server
Tablet servers assigned
tablets immediately
HDFS blocks can be
rebalanced
Some new blocks will go
to new machines
Scaling Up   20
             40
             100
             200
             400
Scaling Up                         20
                                   40
ABOUT 85% INCREASE IN WRITE RATE   100
 EACH TIME CLUSTER SIZE DOUBLED
                                   200
                                   400
Scaling Up                         20
                                   40
HIT 1 MILLION WRITES PER SECOND    100
AT 100 M1.LARGES WITH 50 CLIENTS
                                   200
                                   400
m1.large NameNode
served 400 node cluster
Running ‘at scale’
Some machines will be
lost
Everything keeps working
Accumulo auto-recovers
using write-ahead logs
Occasionally provision
replacement machines
On your schedule
Clients see no exceptions /
errors
Watch writes and reads
scale on monitor page
 HTTP://ACCUMULO-MONITOR:50095
See a list of failed
machines too
How many machines?
May need up to 2x VMs vs.
bare metal
Scaling Down
Identify a set of machines
to remove
Stop tablet servers,
loggers
accumulo admin stop <node>
        IN VERSION 1.4
Decommission HDFS
 datanodes
HTTP://DEVELOPER.YAHOO.COM/HADOOP/TUTORIAL/MODULE2.HTML
NameNode will
re-replicate blocks
Remaining machines have
enough storage?
Can lower replication
factor if necessary
When decommissioning is
done, terminate instances
Scale Up again!
Repeat.
Details
   HTTP://WWW.ACCUMULODATA.COM/EC2.HTML
Questions

Accumulo on EC2