Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo Clusters

Four Orders of Magnitude:
Running Large Scale
Accumulo Clusters
Aaron Cordova
Accumulo Summit, June 2014

to scale1 - (vt) to change the size
of something

“let’s scale the cluster up to
twice the original size”

to scale2 - (vi) to function
properly at a large scale

Notebook Computer
• 16 GB DRAM
• 512 GB Flash Storage
• 2.3 GHz quad-core i7 CPU

Modern Server
• 100s of GB DRAM
• 10s of TB on disk
• 10s of cores

Large Scale
Laptop Server
10 Node
Cluster
100
Nodes
1000
Nodes
10,000
Nodes
10 GB
100 GB
1 TB
10 TB
100 TB
1 PB
10 PB
100 PB
In RAM On Disk

Data Composition
0
45
90
135
180
January February March April
Original Raw Derivative QFDs Indexes

Accumulo Scales
• From GB to PB, Accumulo keeps two things low:
• Administrative effort
• Scan latency

Scan Latency
0
0.013
0.025
0.038
0.05
0 250 500 750 1000

Administrative Overhead
0
3
6
9
12
0 250 500 750 1000
Failed Machines Admin Intervention

Accumulo Scales
• From GB to PB three things grow linearly:
• Total storage size
• Ingest Rate
• Concurrent scans

Ingest Benchmark
0
25
50
75
100
0 250 500 750 1000
Millionsofentriespersecond

AWB Benchmark
http://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf

100 M entries written per second

Graph Benchmark
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

149 M edges traversed per
second

Graph Analysis
Billions of Edges
1
100
10000
Twitter Yahoo! Facebook Accumulo
70,000
1,000
6.6
1.5

Accumulo is designed after
Google’s BigTable

BigTable powers hundreds of
applications at Google

BigTable serves 2+ exabytes
http://hbasecon.com/sessions/#session33

600 M queries per second
organization wide

Starting with ten machines
101

Designing Applications for Scale

Keys to Scaling
1. Live writes go to all servers
2. User requests are satisﬁed by few scans
3. Turning updates into inserts

Keys to Scaling
Writes on all servers Few Scans

Hash / UUID Keys
RowID Col Value
af362de4 Bob
b23dc4be Annie
b98de2ff Joe
c48e2ade $30
c7e43fb2 $25
d938ff3d 32
e2e4dac4 59
e98f2eab3 43
Key Value
userA:name Bob
userA:age 43
userA:account $30
userB:name Annie
userB:age 32
userB:account $25
userC:name Joe
userC:age 59
Uniform writes

Monitor
Participating Tablet Servers
MyTable
Servers Hosted Tablets … Ingest
r1n1 1500 200k
r1n2 1501 210k
r2n1 1499 190k
r2n2 1500 200k

Hash / UUID Keys
RowID Col Value
af362de4 Bob
b23dc4be Annie
b98de2ff Joe
c48e2ade $30
c7e43fb2 $25
d938ff3d 32
e2e4dac4 59
e98f2eab3 43
3 x 1-entry scans on 3 servers
get(userA)

Keys to Scaling
Hash / UUID Keys

Group for Locality
Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
userA:name Bob
userA:age 43
userA:account $30
userB:name Annie
userB:age 32
userB:account $25
userC:name Joe
userC:age 59
RowID Col Value
af362de4 name Annie
af362de4 age 32
af362de4 account $25
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
Still fairly uniform writes

Group for Locality
RowID Col Value
af362de4 name Annie
af362de4 age 32
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
1 x 3-entry scan on 1 server
get(userA)

Keys to Scaling
Grouped Keys

Temporal Keys
Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
RowID Col Value
20140101 44
20140102 22
20140103 23

Temporal Keys
Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
RowID Col Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31

Temporal Keys
Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
20140107 25
20140108 17
RowID Col Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
20140107 25
20140108 17
Always write to one server

Temporal Keys
RowID Col Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
20140107 25
20140108 17
Fetching ranges uses few scans
get(20140101 to 201404)

Keys to Scaling
Temporal Keys

Binned Temporal Keys
Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
RowID Col Value
0_20140101 44
1_20140102 22
2_20140103 23
Uniform Writes

Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
RowID Col Value
0_20140101 44
0_20140104 25
1_20140102 22
1_20140105 31
2_20140103 23
2_20140106 27
Uniform Writes

Key Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
20140107 25
20140108 17
RowID Col Value
0_20140101 44
0_20140104 25
0_20140107 25
1_20140102 22
1_20140105 31
1_20140108 17
2_20140103 23
2_20140106 27
Uniform Writes

RowID Col Value
0_20140101 44
0_20140104 25
0_20140107 25
1_20140102 22
1_20140105 31
1_20140108 17
2_20140103 23
2_20140106 27
One scan per bin
get(20140101 to 201404)

Keys to Scaling

Keys to Scaling
• Key design is critical
• Group data under common row IDs to reduce
scans
• Prepend bins to row IDs to increase write
parallelism

Splits
• Pre-split or organic splits
• Going from dev to production, can ingest a
representative sample, obtain split points and use
them to pre-split a larger system
• Hundreds or thousands of tablets per server is ok
• Want at least one tablet per server

Effect of Compression
• Similar sorted keys compress well
• May need more data than you think to auto-split

Inserts are fast
10s of thousands per second per
machine

Update Types
• Overwrite
• Combine
• Complex

Update - Overwrite
• Performance same as insert
• Ignore (don’t read) existing value
• Accumulo’s Versioning Iterator does the overwrite

Update - Overwrite
RowID Col Value
af362de4 name Annie
af362de4 age 32
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
userB:age -> 34

Update - Overwrite
RowID Col Value
af362de4 name Annie
af362de4 age 34
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
userB:age -> 34

Update - Combine
• Things like X = X + 1
• Normally one would have to read the old value to
do this, but Accumulo Iterators allow multiple
inserts to be combined at scan time, or compaction
• Performance is same as inserts

Update - Combine
RowID Col Value
af362de4 name Annie
af362de4 age 34
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
userB:account -> +10

Update - Combine
RowID Col Value
af362de4 name Annie
af362de4 age 34
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
getAccount(userB)
$35

Update - Combine
After compaction
RowID Col Value
af362de4 name Annie
af362de4 age 34
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43

Update - Complex
• Some updates require looking at more data than
Iterators have access to - such as multiple rows
• These require reading the data out in order to write
the new value
• Performance will be much slower

Update - Complex
userC:account =
getBalance(userA) +
getBalance(userB)
RowID Col Value
af362de4 name Annie
af362de4 age 34
c48e2ade name Joe
c48e2ade age 59
c48e2ade account $40
e2e4dac4 name Bob
e2e4dac4 age 43
35+30 = 65

Update - Complex
userC:account =
getBalance(userA) +
getBalance(userB)
RowID Col Value
af362de4 name Annie
af362de4 age 34
c48e2ade name Joe
c48e2ade age 59
c48e2ade account $65
e2e4dac4 name Bob
e2e4dac4 age 43
35+30 = 65

Planning a Larger-Scale Cluster
102 - 104

Storage vs Ingest
1
1000
1000000
10 100 1000 10000
Ingest Rate 1x1TB 12x3TB
120,000
12,000
1,200
120
10,000
1,000
100
10
StorageTerabytes
MillionsofEntriespersecond

Model for Ingest Rates
A = 0.85log2N * N * S
N - Number of machines
S - Single Server throughput (entries / second)
A - Aggregate Cluster throughput (entries / second)
Expect 85% increase in write rate
when doubling the size of the cluster

Estimating Machines Required
N = 2 (log2(A/S) / 0.7655347)
N - Number of machines
S - Single Server throughput (entries / second)
A - Target Aggregate throughput (entries / second)
Expect 85% increase in write rate
when doubling the size of the cluster

Predicted Cluster Sizes
NumberofMachines
0
3000
6000
9000
12000
Millions of Entries per Second
0 150 300 450 600

Some hardware failures
in the first week
(burn in)

Expect 3 failed HDs in first 3 mo

Another 4 within the first year
http://static.googleusercontent.com/media/
research.google.com/en/us/archive/disk_failures.pdf

Can process the
1000 Genomes data set
260 TB
www.1000genomes.org

Can store and index the
Common Crawl Corpus
!
2.8 Billion web pages
541 TB
commoncrawl.org

One year of Twitter
182 trillion tweets
483 TB
http://www.sec.gov/Archives/edgar/data/
1418091/000119312513390321/d564001ds1.htm

Deploying an Application
Tablet ServersClientsUsers

May not see the affect of writing
to disk for a while

Hardware failure is a regular
occurrence

Hard drive failure about every 5 days
(average).
!
Will be skewed towards beginning of
the year

Can traverse the ‘brain graph’
70 trillion edges, 1 PB
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

Facebook Graph
1s of PB
http://www-conf.slac.stanford.edu/xldb2012/talks/
xldb2012_wed_1105_DhrubaBorthakur.pdf

Netflix Video Master Copies
3.14 PB
http://www.businessweek.com/articles/2013-05-09/netﬂix-reed-
hastings-survive-missteps-to-join-silicon-valleys-elite

World of Warcraft Backend Storage
1.3 PB
http://www.datacenterknowledge.com/archives/2009/11/25/
wows-back-end-10-data-centers-75000-cores/

Webpages, live on the Internet
14.3 Trillion
http://www.factshunt.com/2014/01/
total-number-of-websites-size-of.html

Things like the difference between
two compression algorithms start
to make a big difference

Use range compactions to affect
changes on portions of table

Watch Garbage Collector and
Namenode ops

Garbage Collection > 5 minutes?

Start thinking about
NameNode Federation

Multiple NameNodes
Accumulo
Namenode Namenode
DataNodesDataNodes
Multiple HDFS Clusters

Multiple NameNodes
Accumulo
DataNodes
Multiple NameNodes, shared DataNodes
(Federation. Requires Hadoop 2.0)
Namenode Namenode

More Namenodes = higher risk of
one going down.
!
Can use HA Namenodes in
conjunction w/ Federation

You, my friend, are here to
kick a** and chew bubble gum

1 hardware failure every hour on
average

Entire Internet Archive
15 PB
http://www.motherjones.com/media/2014/05/
internet-archive-wayback-machine-brewster-kahle

A year’s worth of data from the
Large Hadron Collider
15 PB
http://home.web.cern.ch/about/computing

0.1% of all Internet traffic in 2013
43.6 PB
http://www.factshunt.com/2014/01/
total-number-of-websites-size-of.html

Facebook Messaging Data
10s of PB

Facebook Photos
240 billion
High 10s of PB

Can tune back heartbeats,
periodicity of central processes in
general

Can combine multiple PB data
sets

Up to 10 quadrillion entries in a
single table

While maintaining sub-second
lookup times

Data Over Time - Patterns
• Initial Load
• Increasing Velocity
• Focus on Recency
• Historical Summaries

Initial Load
• Get a pile of old data into Accumulo fast
• Latency not important (data is old)
• Throughput critical

Bulk Loading
MapReduce
RFiles Accumulo

If your data isn’t big today,
wait a little while

Accumulo scales up dynamically,
online. No downtime

The first scale, ‘can change size’

Scaling Up
Clients
Accumulo
HDFS
3 physical servers
Each running
a Tablet Server process
and a Data Node process

Scaling Up
Clients
Accumulo
HDFS
Start 3 new Tablet Server procs
3 new Data node processes

Scaling Up
Clients
Accumulo
HDFS
master immediately assigns tablets

Scaling Up
Clients
Accumulo
HDFS
Clients immediately
begin querying new
Tablet Servers

Scaling Up
Clients
Accumulo
HDFS
new Tablet Servers read data from old Data nodes

Scaling Up
Clients
Accumulo
HDFS
new Tablet Servers write data to new Data Nodes

Never really seen
anyone do this

all during the same MapReduce job
reading data out of Accumulo,
summarizing, and writing back

Scaled back down to 20
machines when done

Decommissioned Data Nodes for
safe data consolidation to
remaining 20 nodes

Other ways to go from
10x to 10x+1

followed by HDFS DistCP to new
cluster

Accumulo keeps newly written
data in memory

Block Cache can keep recently
queried data in memory

Combining Iterators make
maintaining summaries of large
amounts of raw events easy

Historical Summaries
0
2000
4000
6000
8000
April May June July
Unique Entities Stored Raw Events Processed

Age-off iterator can automatically
remove data over a certain age

IBM estimates 2.5 exabytes of
data is created every day
http://www-01.ibm.com/software/data/bigdata/
what-is-big-data.html

90% of available data created in
last 2 years
http://www-01.ibm.com/software/data/bigdata/
what-is-big-data.html

25 new 10k node Accumulo
clusters per day

Accumulo is doing it’s part to get
in front of the big data trend

Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo Clusters

More Related Content

Viewers also liked

Similar to Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo Clusters

Recently uploaded

Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo Clusters