Compaction and Splitting in Apache Accumulo

Compaction and
Splitting in Apache
Accumulo
Billie Rinaldi
billie@hortonworks.com
October 24, 2012
© Hortonworks Inc. 2012 Page 1

What are compaction and splitting?

•Accumulo tables are divided into
non-overlapping key ranges called
tablets
•Compaction selects a set of sorted
files for a single tablet and rewrites
them into one file
•Splitting divides a tablet into two
tablets

Page 2
© Hortonworks Inc. 2012

Tablet Overview

•When memory fills, new sorted files
are created by flushing
•Sorted files are combined together
into fewer sorted files

Page 3

How much data are you writing?

•If you never compact – O(N)
…
•If you always compact – O(N2)

…

Page 4

Accumulo Compaction Algorithm

•Compact a set of files when:

size of the
largest file
×
compaction
ratio ≤ sum of the
sizes of files

table.compaction.major.ratio

Page 5

In Action (r = 3, N = 1, W = 1)

Page 6

In Action (r = 3, N = 2, W = 2)

Page 7

In Action (r = 3, N = 3, W = 3)

Page 8

In Action (r = 3, N = 3, W = 6)

Page 9

In Action (r = 3, N = 4, W = 7)

Page 10

In Action (r = 3, N = 5, W = 8)

Page 11

In Action (r = 3, N = 6, W = 9)

Page 12

In Action (r = 3, N = 6, W = 12)

Page 13

In Action (r = 3, N = 7, W = 13)

Page 14

In Action (r = 3, N = 8, W = 14)

Page 15

In Action (r = 3, N = 9, W = 15)

Page 16

In Action (r = 3, N = 9, W = 24)

Page 17

In Action (r = 3, N = 27, W = 90*)

Page 18

Amount of data written

•W(rk) = (k+1)rk – (k-1)rk-1
•Thus, W(N) ≈ O(N log N)

Page 19

HBase Compaction Algorithm


sum of the
size of the
largest file ≤ sizes of ×
compaction
ratio
smaller files

hbase.hstore.compaction.ratio

Page 20

HBase Compaction Algorithm


sum of the
size of the
largest file ≤ sizes of × compaction
ratio
smaller files

1
HBase ratio = Accumulo
ratio –1
Page 21

Other Compaction-related Properties

•Accumulo
table.file.max
tserver.compaction.major.thread.files.open.max
tserver.compaction.major.delay
table.compaction.major.everything.idle

•Hbase
hbase.hstore.compactionThreshold
hbase.hstore.blockingStoreFiles
hbase.hstore.blockingWaitTime
hbase.hstore.compaction.min
hbase.hstore.compaction.max
hbase.hstore.compaction.min.size
hbase.hstore.compaction.max.size
Page 22

Accumulo Splitting

•Always check to see if a split is
needed before compacting
•If it is needed, split first
•File names stored in metadata table

split
threshold
Page 23

Accumulo Splitting Process

•Tablet closed, no new writes
•Three writes to the metadata table
–tablet made smaller & marked as splitting
–new tablet added
–original tablet's splitting marks removed
•Tablet server swaps new tablets for
old tablet in its online tablet list
•Master informed
Page 24

Accumulo Splitting Recovery

•Whenever a tablet is brought online,
the tablet server checks to see if it
has split marks.
•If so, it assumes the splitting
process was interrupted and
finishes making changes to the
metadata table.

Page 25

Hortonworks Data Platform
• Simplify deployment to get
started quickly and easily

• Monitor, manage any size
cluster with familiar
console and tools

1 • Only platform to include
data integration services
to interact with any data

• Metadata services opens
the platform for integration
with existing applications

• Dependable high
availability architecture
 Reduce risks and cost of adoption
 Lower the total cost to administer and provision • Tested at scale to future
proof your cluster growth
 Integrate with your existing ecosystem

Page 26

Hortonworks Training

The expert source for
Apache Hadoop training &
certification

Role-based Developer and
Administration training
– Coursework built and maintained by the core Apache Hadoop development team.
– The “right” course, with the most extensive and realistic hands-on materials
– Provide an immersive experience into real-world Hadoop scenarios
– Public and Private courses available

Comprehensive Apache Hadoop
Page 27

Next Steps?

1 Download Hortonworks Data Platform
hortonworks.com/download

2 Use the getting started guide
hortonworks.com/get-started

3 Learn more… get support

Hortonworks Support
• Expert role based training • Full lifecycle technical support
• Course for admins, developers across four service levels
and operators • Delivered by Apache Hadoop
• Certification program Experts/Committers
• Custom onsite options • Forward-compatible
hortonworks.com/training hortonworks.com/support

Page 28

Questions?
dev@accumulo.apache.org

Page 29

Compaction and Splitting in Apache Accumulo

More Related Content

What's hot

Similar to Compaction and Splitting in Apache Accumulo

More from Hortonworks

Recently uploaded

Compaction and Splitting in Apache Accumulo

Editor's Notes