Four Orders of Magnitude:
Running Large Scale
Accumulo Clusters
Aaron Cordova
Accumulo Summit, June 2014
Scale, Security, Schema
Scale
to scale1 - (vt) to change the size
of something
“let’s scale the cluster up to
twice the original size”
to scale2 - (vi) to function
properly at a large scale
“Accumulo scales”
What is Large Scale?
Notebook Computer
• 16 GB DRAM
• 512 GB Flash Storage
• 2.3 GHz quad-core i7 CPU
Modern Server
• 100s of GB DRAM
• 10s of TB on disk
• 10s of cores
Large Scale
Laptop Server
10 Node
Cluster
100
Nodes
1000
Nodes
10,000
Nodes
10 GB
100 GB
1 TB
10 TB
100 TB
1 PB
10 PB
100 PB
In RAM On Disk
Data Composition
0
45
90
135
180
January February March April
Original Raw Derivative QFDs Indexes
Accumulo Scales
• From GB to PB, Accumulo keeps two things low:
• Administrative effort
• Scan latency
Scan Latency
0
0.013
0.025
0.038
0.05
0 250 500 750 1000
Administrative Overhead
0
3
6
9
12
0 250 500 750 1000
Failed Machines Admin Intervention
Accumulo Scales
• From GB to PB three things grow linearly:
• Total storage size
• Ingest Rate
• Concurrent scans
Ingest Benchmark
0
25
50
75
100
0 250 500 750 1000
Millionsofentriespersecond
AWB Benchmark
http://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf
1000 machines
100 M entries written per second
408 terabytes
7.56 trillion total entries
Graph Benchmark
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
1200 machines
4.4 trillion vertices
70.4 trillion edges
149 M edges traversed per
second
1 petabyte
Graph Analysis
Billions of Edges
1
100
10000
Twitter Yahoo! Facebook Accumulo
70,000
1,000
6.6
1.5
Accumulo is designed after
Google’s BigTable
BigTable powers hundreds of
applications at Google
BigTable serves 2+ exabytes
http://hbasecon.com/sessions/#session33
600 M queries per second
organization wide
From 10 to 10,000
Starting with ten machines
101
One rack
1 TB RAM
10-100 TB Disk
Hardware failures rare
Test Application Designs
Designing Applications for Scale
Keys to Scaling
1. Live writes go to all servers
2. User requests are satisfied by few scans
3. Turning updates into inserts
Keys to Scaling
Writes on all servers Few Scans
Hash / UUID Keys
RowID Col Value
af362de4 Bob
b23dc4be Annie
b98de2ff Joe
c48e2ade $30
c7e43fb2 $25
d938ff3d 32
e2e4dac4 59
e98f2eab3 43
Key Value
userA:name Bob
userA:age 43
userA:account $30
userB:name Annie
userB:age 32
userB:account $25
userC:name Joe
userC:age 59
Uniform writes
Monitor
Participating Tablet Servers
MyTable
Servers Hosted Tablets … Ingest
r1n1 1500 200k
r1n2 1501 210k
r2n1 1499 190k
r2n2 1500 200k
Hash / UUID Keys
RowID Col Value
af362de4 Bob
b23dc4be Annie
b98de2ff Joe
c48e2ade $30
c7e43fb2 $25
d938ff3d 32
e2e4dac4 59
e98f2eab3 43
3 x 1-entry scans on 3 servers
get(userA)
Keys to Scaling
Writes on all servers Few Scans
Hash / UUID Keys
Group for Locality
Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
userA:name Bob
userA:age 43
userA:account $30
userB:name Annie
userB:age 32
userB:account $25
userC:name Joe
userC:age 59
RowID Col Value
af362de4 name Annie
af362de4 age 32
af362de4 account $25
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
Still fairly uniform writes
Group for Locality
RowID Col Value
af362de4 name Annie
af362de4 age 32
af362de4 account $25
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
1 x 3-entry scan on 1 server
get(userA)
Keys to Scaling
Writes on all servers Few Scans
Grouped Keys
Temporal Keys
Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
RowID Col Value
20140101 44
20140102 22
20140103 23
Temporal Keys
Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
RowID Col Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
Temporal Keys
Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
20140107 25
20140108 17
RowID Col Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
20140107 25
20140108 17
Always write to one server
No write parallelism
Temporal Keys
RowID Col Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
20140107 25
20140108 17
Fetching ranges uses few scans
get(20140101 to 201404)
Keys to Scaling
Writes on all servers Few Scans
Temporal Keys
Binned Temporal Keys
Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
RowID Col Value
0_20140101 44
1_20140102 22
2_20140103 23
Uniform Writes
Binned Temporal Keys
Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
RowID Col Value
0_20140101 44
0_20140104 25
1_20140102 22
1_20140105 31
2_20140103 23
2_20140106 27
Uniform Writes
Binned Temporal Keys
Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
20140107 25
20140108 17
RowID Col Value
0_20140101 44
0_20140104 25
0_20140107 25
1_20140102 22
1_20140105 31
1_20140108 17
2_20140103 23
2_20140106 27
Uniform Writes
Binned Temporal Keys
RowID Col Value
0_20140101 44
0_20140104 25
0_20140107 25
1_20140102 22
1_20140105 31
1_20140108 17
2_20140103 23
2_20140106 27
One scan per bin
get(20140101 to 201404)
Keys to Scaling
Writes on all servers Few Scans
Binned Temporal Keys
Keys to Scaling
• Key design is critical
• Group data under common row IDs to reduce
scans
• Prepend bins to row IDs to increase write
parallelism
Splits
• Pre-split or organic splits
• Going from dev to production, can ingest a
representative sample, obtain split points and use
them to pre-split a larger system
• Hundreds or thousands of tablets per server is ok
• Want at least one tablet per server
Effect of Compression
• Similar sorted keys compress well
• May need more data than you think to auto-split
Inserts are fast
10s of thousands per second per
machine
Updates *can* be …
Update Types
• Overwrite
• Combine
• Complex
Update - Overwrite
• Performance same as insert
• Ignore (don’t read) existing value
• Accumulo’s Versioning Iterator does the overwrite
Update - Overwrite
RowID Col Value
af362de4 name Annie
af362de4 age 32
af362de4 account $25
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
userB:age -> 34
Update - Overwrite
RowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $25
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
userB:age -> 34
Update - Combine
• Things like X = X + 1
• Normally one would have to read the old value to
do this, but Accumulo Iterators allow multiple
inserts to be combined at scan time, or compaction
• Performance is same as inserts
Update - Combine
RowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $25
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
userB:account -> +10
Update - Combine
RowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $25
af362de4 account $10
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
userB:account -> +10
Update - Combine
RowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $25
af362de4 account $10
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
getAccount(userB)
$35
Update - Combine
After compaction
RowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $35
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
Update - Complex
• Some updates require looking at more data than
Iterators have access to - such as multiple rows
• These require reading the data out in order to write
the new value
• Performance will be much slower
Update - Complex
userC:account =
getBalance(userA) +
getBalance(userB)
RowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $35
c48e2ade name Joe
c48e2ade age 59
c48e2ade account $40
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
35+30 = 65
Update - Complex
userC:account =
getBalance(userA) +
getBalance(userB)
RowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $35
c48e2ade name Joe
c48e2ade age 59
c48e2ade account $65
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
35+30 = 65
Planning a Larger-Scale Cluster
102 - 104
Storage vs Ingest
1
1000
1000000
10 100 1000 10000
Ingest Rate 1x1TB 12x3TB
120,000
12,000
1,200
120
10,000
1,000
100
10
StorageTerabytes
MillionsofEntriespersecond
Model for Ingest Rates
A = 0.85log2N * N * S
N - Number of machines
S - Single Server throughput (entries / second)
A - Aggregate Cluster throughput (entries / second)
Expect 85% increase in write rate
when doubling the size of the cluster
Estimating Machines Required
N = 2 (log2(A/S) / 0.7655347)
N - Number of machines
S - Single Server throughput (entries / second)
A - Target Aggregate throughput (entries / second)
Expect 85% increase in write rate
when doubling the size of the cluster
Predicted Cluster Sizes
NumberofMachines
0
3000
6000
9000
12000
Millions of Entries per Second
0 150 300 450 600
100 Machines
102
Multiple racks
10 TB RAM
100 TB - 1PB Disk
Some hardware failures
in the first week
(burn in)
Expect 3 failed HDs in first 3 mo
Another 4 within the first year
http://static.googleusercontent.com/media/
research.google.com/en/us/archive/disk_failures.pdf
Can process the
1000 Genomes data set
260 TB
www.1000genomes.org
Can store and index the
Common Crawl Corpus
!
2.8 Billion web pages
541 TB
commoncrawl.org
One year of Twitter
182 trillion tweets
483 TB
http://www.sec.gov/Archives/edgar/data/
1418091/000119312513390321/d564001ds1.htm
Deploying an Application
Tablet ServersClientsUsers
May not see the affect of writing
to disk for a while
1000 machines
103
Multiple rows of racks
100 TB RAM
1-10 PB Disk
Hardware failure is a regular
occurrence
Hard drive failure about every 5 days
(average).
!
Will be skewed towards beginning of
the year
Can traverse the ‘brain graph’
70 trillion edges, 1 PB
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
Facebook Graph
1s of PB
http://www-conf.slac.stanford.edu/xldb2012/talks/
xldb2012_wed_1105_DhrubaBorthakur.pdf
Netflix Video Master Copies
3.14 PB
http://www.businessweek.com/articles/2013-05-09/netflix-reed-
hastings-survive-missteps-to-join-silicon-valleys-elite
World of Warcraft Backend Storage
1.3 PB
http://www.datacenterknowledge.com/archives/2009/11/25/
wows-back-end-10-data-centers-75000-cores/
Webpages, live on the Internet
14.3 Trillion
http://www.factshunt.com/2014/01/
total-number-of-websites-size-of.html
Things like the difference between
two compression algorithms start
to make a big difference
Use range compactions to affect
changes on portions of table
Lay off Zookeeper
Watch Garbage Collector and
Namenode ops
Garbage Collection > 5 minutes?
Start thinking about
NameNode Federation
Accumulo 1.6
Multiple NameNodes
Accumulo
Namenode Namenode
DataNodesDataNodes
Multiple HDFS Clusters
Multiple NameNodes
Accumulo
DataNodes
Multiple NameNodes, shared DataNodes
(Federation. Requires Hadoop 2.0)
Namenode Namenode
More Namenodes = higher risk of
one going down.
!
Can use HA Namenodes in
conjunction w/ Federation
10,000 machines
104
You, my friend, are here to
kick a** and chew bubble gum
1 PB RAM
10-100 PB Disk
1 hardware failure every hour on
average
Entire Internet Archive
15 PB
http://www.motherjones.com/media/2014/05/
internet-archive-wayback-machine-brewster-kahle
A year’s worth of data from the
Large Hadron Collider
15 PB
http://home.web.cern.ch/about/computing
0.1% of all Internet traffic in 2013
43.6 PB
http://www.factshunt.com/2014/01/
total-number-of-websites-size-of.html
Facebook Messaging Data
10s of PB
http://www-conf.slac.stanford.edu/xldb2012/talks/
xldb2012_wed_1105_DhrubaBorthakur.pdf
Facebook Photos
240 billion
High 10s of PB
http://www-conf.slac.stanford.edu/xldb2012/talks/
xldb2012_wed_1105_DhrubaBorthakur.pdf
Must use multiple NameNodes
Can tune back heartbeats,
periodicity of central processes in
general
Can combine multiple PB data
sets
Up to 10 quadrillion entries in a
single table
While maintaining sub-second
lookup times
Only with Accumulo 1.6
Dealing with data over time
Data Over Time - Patterns
• Initial Load
• Increasing Velocity
• Focus on Recency
• Historical Summaries
Initial Load
• Get a pile of old data into Accumulo fast
• Latency not important (data is old)
• Throughput critical
Bulk Load RFiles
Bulk Loading
MapReduce
RFiles Accumulo
Increasing velocity
If your data isn’t big today,
wait a little while
Accumulo scales up dynamically,
online. No downtime
The first scale, ‘can change size’
Scaling Up
Clients
Accumulo
HDFS
3 physical servers
Each running
a Tablet Server process
and a Data Node process
Scaling Up
Clients
Accumulo
HDFS
Start 3 new Tablet Server procs
3 new Data node processes
Scaling Up
Clients
Accumulo
HDFS
master immediately assigns tablets
Scaling Up
Clients
Accumulo
HDFS
Clients immediately
begin querying new
Tablet Servers
Scaling Up
Clients
Accumulo
HDFS
new Tablet Servers read data from old Data nodes
Scaling Up
Clients
Accumulo
HDFS
new Tablet Servers write data to new Data Nodes
Never really seen
anyone do this
Except myself
20 machines in Amazon EC2
to 400 machines
all during the same MapReduce job
reading data out of Accumulo,
summarizing, and writing back
Scaled back down to 20
machines when done
Just killed Tablet Servers
Decommissioned Data Nodes for
safe data consolidation to
remaining 20 nodes
Other ways to go from
10x to 10x+1
Accumulo Table Export
followed by HDFS DistCP to new
cluster
Maybe new replication feature
Newer Data is Read more Often
Accumulo keeps newly written
data in memory
Block Cache can keep recently
queried data in memory
Combining Iterators make
maintaining summaries of large
amounts of raw events easy
Reduces storage burden
Historical Summaries
0
2000
4000
6000
8000
April May June July
Unique Entities Stored Raw Events Processed
Age-off iterator can automatically
remove data over a certain age
IBM estimates 2.5 exabytes of
data is created every day
http://www-01.ibm.com/software/data/bigdata/
what-is-big-data.html
90% of available data created in
last 2 years
http://www-01.ibm.com/software/data/bigdata/
what-is-big-data.html
25 new 10k node Accumulo
clusters per day
Accumulo is doing it’s part to get
in front of the big data trend
Questions ?
@aaroncordova

Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo Clusters