Securely explore your data 
PERFORMANCE MODELS 
FOR APACHE ACCUMULO: 
THE HEAVY TAIL OF A SHARED-NOTHING 
ARCHITECTURE 
Chris McCubbin 
Director of Data Science 
Sqrrl Data, Inc.
I’M NOT ADAM FUCHS 
• But perhaps I’m still an interesting guy 
• MS in CS from UMBC in Network Security and 
Quantum Computing 
• 8 years at JHU/APL working on UxV Swarms 
• 4 years at JHU/APL and TexelTek creating Big 
Data Applications for the NSA 
• Co-founder and Director of Data Science at Sqrrl 
©2014 Sqrrl Data, Inc 2
SO, YOUR DISTRIBUTED 
APPLICATION IS SLOW 
• Today’s distributed applications run on tens or 
hundreds of library components 
• Many versions so internet advice could be ineffective, or 
worse, flat out wrong 
• Hundreds of settings 
• Some, shall we say, could be better documented 
• Shared-nothing architectures are usually “shared-little” 
architectures with tricky interactions 
• Profiling is hard and time-consuming 
• What do we do? 
©2014 Sqrrl Data, Inc 3
TODAY’S TALK 
1. Quick intro to performance optimization 
2. Tricks and techniques for targeted distributed 
application modeling performance improvement 
3. A deep dive into improving bulk load application 
performance 
©2014 Sqrrl Data, Inc 4
The Apache Accumulo™ sorted, distributed key/value store is a secure, robust, 
scalable, high performance data storage and retrieval system. 
• Many applications in real-time storage and analysis of “big data”: 
• Spatio-temporal indexing in non-relational distributed databases - Fox et al 
2013 IEEE International Congress on Big Data 
• Big Data Dimensional Analysis - Gadepally et al IEEE HPEC 2014 
• Leading its peers in performance and scalability: 
• Achieving 100,000,000 database inserts per second using Accumulo and 
D4M - Kepner et al IEEE HPEC 2014 
• An NSA Big Graph experiment (Technical Report NSA-RD-2013-056002v1) 
• Benchmarking Apache Accumulo BigData Distributed Table Store Using Its 
Continuous Test Suite - Sen et al 2013 IEEE International Congress on Big 
Data 
For more papers and presentations, see http://accumulo.apache.org/papers.html 
©2014 Sqrrl Data, Inc 5
SCALING UP: DIVIDE & CONQUER 
• Collections of KV pairs form Tables 
• Tables are partitioned into Tablets 
• Metadata tablets hold info about 
other tablets, forming a 3-level 
hierarchy 
• A Tablet is a unit of work for a 
Tablet Server 
Table: 
Adam’s 
Table 
Table: 
Encyclopedia 
Table: 
Foo 
Data 
Tablet 
-­‐∞ 
: 
thing 
Data 
Tablet 
thing 
: 
∞ 
Data 
Tablet 
-­‐∞ 
: 
Ocelot 
Data 
Tablet 
Ocelot 
: 
Yak 
Data 
Tablet 
Yak 
: 
∞ 
Data 
Tablet 
-­‐∞ 
to 
∞ 
Well-­‐Known 
Loca9on 
(zookeeper) 
Root 
Tablet 
-­‐∞ 
to 
∞ 
Metadata 
Tablet 
2 
“Encyclopedia:Ocelot” 
to 
∞ 
Metadata 
Tablet 
1 
-­‐∞ 
to 
“Encyclopedia:Ocelot” 
©2014 Sqrrl Data, Inc 6
PERFORMANCE ANALYSIS CYCLE 
Simulate & 
Experiment 
Modify 
Code 
Analyze 
Start: 
Create 
Model 
Refine 
Model 
Outputs: 
Better Code 
+ Models 
©2014 Sqrrl Data, Inc 7
MAKING A MODEL 
• Determine points of low-impact metrics 
• Add some if needed 
• Create parallel state machine models with 
components driven by these metrics 
• Estimate running times and bottlenecks from 
a-priori information and/or apply measured 
statistics 
• Focus testing on validation of the initial 
model and the (estimated) pain points 
• Apply Amdahl’s Law 
• Rinse, repeat 
©2014 Sqrrl Data, Inc 8
BULK INGEST OVERVIEW 
• Accumulo supports two mechanisms to bring 
data in: streaming ingest and bulk ingest. 
• Bulk Ingest 
• Goal: maximize throughput without constraining 
latency. 
• create a set of Accumulo Rfiles, then register those 
files with Accumulo. 
• RFiles are groups of sorted key-value pairs with 
some indexing information 
• MapReduce has a built-in key sorting phase: a good 
fit to produce RFiles 
©2014 Sqrrl Data, Inc 9
BULK INGEST MODEL 
10 
Map Reduce Register 
Time 
©2014 Sqrrl Data, Inc
BULK INGEST MODEL 
11 
Hypothetical Resource Usage 
Time 
• 100% CPU 
• 20% Disk 
• 0% Network 
• 46 seconds 
• 40% CPU 
• 100% Disk 
• 20% Network 
• 168 seconds 
• 10% CPU 
• 20% Disk 
• 40% Network 
• 17 seconds 
©2014 Sqrrl Data, Inc 
Map Reduce Register
INSIGHT 
• Spare disk here, spare CPU there – can we even out resource consumption? 
• Why did reduce take 168 seconds? It should be more like 40 seconds. 
• No clear bottleneck during registration – is there a synchronization or 
serialization problem? 
12 
Time 
• 100% CPU 
• 20% Disk 
• 0% Network 
• 46 seconds 
• 40% CPU 
• 100% Disk 
• 20% Network 
• 168 seconds 
• 10% CPU 
• 20% Disk 
• 40% Network 
• 17 seconds 
©2014 Sqrrl Data, Inc 
Map Reduce Register
LOOKING DEEPER: 
REFINED BULK INGEST MODEL 
Reduce Thread 
Map Thread 
13 
Map 
Setup Map Sort 
Sort Reduce Output 
Spill Merge 
Serve 
Shuffle 
Time 
©2014 Sqrrl Data, Inc 
Parallel Latch
BULK INGEST MODEL PREDICTIONS 
• We can constrain parts of the model by physical 
throughput limitations 
• Disk -> memory (100Mbps avg 7200rpm seq. read rate) 
• Input reader 
• Memory -> Disk (100Mbps) 
• Spill, OutputWriter 
• Disk -> Disk (50Mbps) 
• Merge 
• Network (Gigabit = 125Mbps) 
• Shuffle 
• And/or algorithmic limitations 
• Sort, (Our) Map, (Our) Reduce, SerDe 
©2014 Sqrrl Data, Inc 14
PERFORMANCE GOAL MODEL 
Performance goals obtained through: 
• Simulation of individual components 
• Prediction of available resources at runtime 
©2014 Sqrrl Data, Inc 15
INSTRUMENTATION 
application version 1.3.3 SYSTEM DATA 
application sha 8d17baf8 node num 1 input type arcsight 
yarn.nodemanager.resource.memory-mb 43008 map num containers 20 input block size 32 
yarn.scheduler.minimum-allocation-mb 2048 red num containers 20 input block count 20 
yarn.scheduler.maximum-allocation-mb 43008 cores physical 12 input total 672054649 
yarn.app.mapreduce.am.resource.mb 2048 cores logical 24 output map 9313303723 
yarn.app.mapreduce.am.command-opts -Xmx1536m disk num 8 output map:combine input records 243419324 
mapreduce.map.memory.mb 2048 disk bandwidth 100 output map:combine records out 209318830 
mapreduce.map.java.opts -Xmx1638m replication 1 output map:spill 7325671992 
mapreduce.reduce.memory.mb 2048 monitoring TRUE output final 573802787 
mapreduce.reduce.java.opts -Xmx1638m output map:combine 7301374577 
mapreduce.task.io.sort.mb 100 TIME 
mapreduce.map.sort.spill.percent 0.8 map:setup avg 8 RATIOS 
mapreduce.task.io.sort.factor 10 map:map avg 12 input explosion factor 13.877904 
mapreduce.reduce.shuffle.parallelcopies 5 map:sort avg 12 compression intermediate 1.003327786 
mapreduce.job.reduce.slowstart.completedmaps 1 map:spill avg 12 load combiner output 0.783972562 
mapreduce.map.output.compress FALSE map:spill count 7 total ratio 0.786581455 
mapred.map.output.compression.codec n/a map:merge avg 46 
description baseline map total 290 CONSTANTS 
red:shuffle avg 6 avg schema entry size (bytes) 59 
red:merge avg 38 
red:reduce avg 68 effective MB/sec 1.618488025 
red:total avg 112 
red:reducer count 20 
job:total 396 
©2014 Sqrrl Data, Inc 16
PERFORMANCE MEASUREMENT 
Baseline (naive implementation) 
Reduce Thread 
Map Thread 
Map 
Setup Map Sort 
Sort Reduce Output 
Spill Merge 
Serve 
Shuffle 
©2014 Sqrrl Data, Inc 17
PATH TO IMPROVEMENT 
1. Profiling revealed much time spent serializing/ 
deserializing Key 
2. With proper configuration, MapReduce supports 
comparison of keys in serialized form 
3. Rewriting Key’s serialization lead to an order-preserving 
encoding, easy to compare in serialized form 
4. Configure MapReduce to use native code to compare 
Keys 
5. Tweak map input size and spill memory for as few spills 
as possible 
©2014 Sqrrl Data, Inc 18
PERFORMANCE MEASUREMENT 
Optimized sorting 
• Improvements: 
• Time for map-side merge went down 
• Sort performance drastically improved in both 
map and reduce phases 
• 300% faster 
©2014 Sqrrl Data, Inc 19
PERFORMANCE MEASUREMENT 
Optimized sorting 
Reduce Thread 
Map Thread 
Map 
Setup Map Sort 
Sort Reduce Output 
Spill Merge 
Serve 
Shuffle 
Insights: 
• Map is slower than expected 
• Output is disk bound maybe we can move more processing to Reduce 
• “Reverse Amdahl’s law” 
• Intermediate data inflation ratio (output/input for map) is very high 
©2014 Sqrrl Data, Inc 20
PATH TO IMPROVEMENT 
1. Profiling revealed much time spent copying data 
2. Evaluation of data passed from map to reduce 
revealed inefficiencies: 
• Constant timestamp cost 8 bytes per key 
• Repeated column names could be encoded/ 
compressed 
• Some Key/Value pairs didn’t need to be created 
until reduce 
©2014 Sqrrl Data, Inc 21
PERFORMANCE MEASUREMENT 
Optimized map code 
• Improvement: 
• Big speedup in map function 
• Twice as fast 
• Reduced intermediate inflation sped up all 
steps between map and reduce 
©2014 Sqrrl Data, Inc 22
DO TRY THIS AT HOME 
Hints for Accumulo Application Optimization 
With these steps, we achieved 6X speedup: 
• Perform comparisons on serialized objects 
• With Map/Reduce, calculate how many merge 
steps are needed 
• Avoid premature data inflation 
• Leverage compression to shift bottlenecks 
• Always consider how fast your code should run 
©2014 Sqrrl Data, Inc 23
SOME CURRENT ACCUMULO 
PERFORMANCE PROJECTS 
• Optimize metadata operations 
• Batch to improve throughput (ACCUMULO-2175, 
ACCUMULO-2889) 
• Remove from critical path where possible 
• Optimize write-ahead log performance 
• Maximize throughput 
• Reduce flushes 
• Parallelize WALs (ACCUMULO-1083) 
• Avoid downtime by pre-allocating 
©2014 Sqrrl Data, Inc 24
Securely explore your data 
SQRRL IS HIRING! 
QUESTIONS? 
Chris McCubbin 
Director of Data Science 
Sqrrl Data, Inc.

Performance Models for Apache Accumulo

  • 1.
    Securely explore yourdata PERFORMANCE MODELS FOR APACHE ACCUMULO: THE HEAVY TAIL OF A SHARED-NOTHING ARCHITECTURE Chris McCubbin Director of Data Science Sqrrl Data, Inc.
  • 2.
    I’M NOT ADAMFUCHS • But perhaps I’m still an interesting guy • MS in CS from UMBC in Network Security and Quantum Computing • 8 years at JHU/APL working on UxV Swarms • 4 years at JHU/APL and TexelTek creating Big Data Applications for the NSA • Co-founder and Director of Data Science at Sqrrl ©2014 Sqrrl Data, Inc 2
  • 3.
    SO, YOUR DISTRIBUTED APPLICATION IS SLOW • Today’s distributed applications run on tens or hundreds of library components • Many versions so internet advice could be ineffective, or worse, flat out wrong • Hundreds of settings • Some, shall we say, could be better documented • Shared-nothing architectures are usually “shared-little” architectures with tricky interactions • Profiling is hard and time-consuming • What do we do? ©2014 Sqrrl Data, Inc 3
  • 4.
    TODAY’S TALK 1.Quick intro to performance optimization 2. Tricks and techniques for targeted distributed application modeling performance improvement 3. A deep dive into improving bulk load application performance ©2014 Sqrrl Data, Inc 4
  • 5.
    The Apache Accumulo™sorted, distributed key/value store is a secure, robust, scalable, high performance data storage and retrieval system. • Many applications in real-time storage and analysis of “big data”: • Spatio-temporal indexing in non-relational distributed databases - Fox et al 2013 IEEE International Congress on Big Data • Big Data Dimensional Analysis - Gadepally et al IEEE HPEC 2014 • Leading its peers in performance and scalability: • Achieving 100,000,000 database inserts per second using Accumulo and D4M - Kepner et al IEEE HPEC 2014 • An NSA Big Graph experiment (Technical Report NSA-RD-2013-056002v1) • Benchmarking Apache Accumulo BigData Distributed Table Store Using Its Continuous Test Suite - Sen et al 2013 IEEE International Congress on Big Data For more papers and presentations, see http://accumulo.apache.org/papers.html ©2014 Sqrrl Data, Inc 5
  • 6.
    SCALING UP: DIVIDE& CONQUER • Collections of KV pairs form Tables • Tables are partitioned into Tablets • Metadata tablets hold info about other tablets, forming a 3-level hierarchy • A Tablet is a unit of work for a Tablet Server Table: Adam’s Table Table: Encyclopedia Table: Foo Data Tablet -­‐∞ : thing Data Tablet thing : ∞ Data Tablet -­‐∞ : Ocelot Data Tablet Ocelot : Yak Data Tablet Yak : ∞ Data Tablet -­‐∞ to ∞ Well-­‐Known Loca9on (zookeeper) Root Tablet -­‐∞ to ∞ Metadata Tablet 2 “Encyclopedia:Ocelot” to ∞ Metadata Tablet 1 -­‐∞ to “Encyclopedia:Ocelot” ©2014 Sqrrl Data, Inc 6
  • 7.
    PERFORMANCE ANALYSIS CYCLE Simulate & Experiment Modify Code Analyze Start: Create Model Refine Model Outputs: Better Code + Models ©2014 Sqrrl Data, Inc 7
  • 8.
    MAKING A MODEL • Determine points of low-impact metrics • Add some if needed • Create parallel state machine models with components driven by these metrics • Estimate running times and bottlenecks from a-priori information and/or apply measured statistics • Focus testing on validation of the initial model and the (estimated) pain points • Apply Amdahl’s Law • Rinse, repeat ©2014 Sqrrl Data, Inc 8
  • 9.
    BULK INGEST OVERVIEW • Accumulo supports two mechanisms to bring data in: streaming ingest and bulk ingest. • Bulk Ingest • Goal: maximize throughput without constraining latency. • create a set of Accumulo Rfiles, then register those files with Accumulo. • RFiles are groups of sorted key-value pairs with some indexing information • MapReduce has a built-in key sorting phase: a good fit to produce RFiles ©2014 Sqrrl Data, Inc 9
  • 10.
    BULK INGEST MODEL 10 Map Reduce Register Time ©2014 Sqrrl Data, Inc
  • 11.
    BULK INGEST MODEL 11 Hypothetical Resource Usage Time • 100% CPU • 20% Disk • 0% Network • 46 seconds • 40% CPU • 100% Disk • 20% Network • 168 seconds • 10% CPU • 20% Disk • 40% Network • 17 seconds ©2014 Sqrrl Data, Inc Map Reduce Register
  • 12.
    INSIGHT • Sparedisk here, spare CPU there – can we even out resource consumption? • Why did reduce take 168 seconds? It should be more like 40 seconds. • No clear bottleneck during registration – is there a synchronization or serialization problem? 12 Time • 100% CPU • 20% Disk • 0% Network • 46 seconds • 40% CPU • 100% Disk • 20% Network • 168 seconds • 10% CPU • 20% Disk • 40% Network • 17 seconds ©2014 Sqrrl Data, Inc Map Reduce Register
  • 13.
    LOOKING DEEPER: REFINEDBULK INGEST MODEL Reduce Thread Map Thread 13 Map Setup Map Sort Sort Reduce Output Spill Merge Serve Shuffle Time ©2014 Sqrrl Data, Inc Parallel Latch
  • 14.
    BULK INGEST MODELPREDICTIONS • We can constrain parts of the model by physical throughput limitations • Disk -> memory (100Mbps avg 7200rpm seq. read rate) • Input reader • Memory -> Disk (100Mbps) • Spill, OutputWriter • Disk -> Disk (50Mbps) • Merge • Network (Gigabit = 125Mbps) • Shuffle • And/or algorithmic limitations • Sort, (Our) Map, (Our) Reduce, SerDe ©2014 Sqrrl Data, Inc 14
  • 15.
    PERFORMANCE GOAL MODEL Performance goals obtained through: • Simulation of individual components • Prediction of available resources at runtime ©2014 Sqrrl Data, Inc 15
  • 16.
    INSTRUMENTATION application version1.3.3 SYSTEM DATA application sha 8d17baf8 node num 1 input type arcsight yarn.nodemanager.resource.memory-mb 43008 map num containers 20 input block size 32 yarn.scheduler.minimum-allocation-mb 2048 red num containers 20 input block count 20 yarn.scheduler.maximum-allocation-mb 43008 cores physical 12 input total 672054649 yarn.app.mapreduce.am.resource.mb 2048 cores logical 24 output map 9313303723 yarn.app.mapreduce.am.command-opts -Xmx1536m disk num 8 output map:combine input records 243419324 mapreduce.map.memory.mb 2048 disk bandwidth 100 output map:combine records out 209318830 mapreduce.map.java.opts -Xmx1638m replication 1 output map:spill 7325671992 mapreduce.reduce.memory.mb 2048 monitoring TRUE output final 573802787 mapreduce.reduce.java.opts -Xmx1638m output map:combine 7301374577 mapreduce.task.io.sort.mb 100 TIME mapreduce.map.sort.spill.percent 0.8 map:setup avg 8 RATIOS mapreduce.task.io.sort.factor 10 map:map avg 12 input explosion factor 13.877904 mapreduce.reduce.shuffle.parallelcopies 5 map:sort avg 12 compression intermediate 1.003327786 mapreduce.job.reduce.slowstart.completedmaps 1 map:spill avg 12 load combiner output 0.783972562 mapreduce.map.output.compress FALSE map:spill count 7 total ratio 0.786581455 mapred.map.output.compression.codec n/a map:merge avg 46 description baseline map total 290 CONSTANTS red:shuffle avg 6 avg schema entry size (bytes) 59 red:merge avg 38 red:reduce avg 68 effective MB/sec 1.618488025 red:total avg 112 red:reducer count 20 job:total 396 ©2014 Sqrrl Data, Inc 16
  • 17.
    PERFORMANCE MEASUREMENT Baseline(naive implementation) Reduce Thread Map Thread Map Setup Map Sort Sort Reduce Output Spill Merge Serve Shuffle ©2014 Sqrrl Data, Inc 17
  • 18.
    PATH TO IMPROVEMENT 1. Profiling revealed much time spent serializing/ deserializing Key 2. With proper configuration, MapReduce supports comparison of keys in serialized form 3. Rewriting Key’s serialization lead to an order-preserving encoding, easy to compare in serialized form 4. Configure MapReduce to use native code to compare Keys 5. Tweak map input size and spill memory for as few spills as possible ©2014 Sqrrl Data, Inc 18
  • 19.
    PERFORMANCE MEASUREMENT Optimizedsorting • Improvements: • Time for map-side merge went down • Sort performance drastically improved in both map and reduce phases • 300% faster ©2014 Sqrrl Data, Inc 19
  • 20.
    PERFORMANCE MEASUREMENT Optimizedsorting Reduce Thread Map Thread Map Setup Map Sort Sort Reduce Output Spill Merge Serve Shuffle Insights: • Map is slower than expected • Output is disk bound maybe we can move more processing to Reduce • “Reverse Amdahl’s law” • Intermediate data inflation ratio (output/input for map) is very high ©2014 Sqrrl Data, Inc 20
  • 21.
    PATH TO IMPROVEMENT 1. Profiling revealed much time spent copying data 2. Evaluation of data passed from map to reduce revealed inefficiencies: • Constant timestamp cost 8 bytes per key • Repeated column names could be encoded/ compressed • Some Key/Value pairs didn’t need to be created until reduce ©2014 Sqrrl Data, Inc 21
  • 22.
    PERFORMANCE MEASUREMENT Optimizedmap code • Improvement: • Big speedup in map function • Twice as fast • Reduced intermediate inflation sped up all steps between map and reduce ©2014 Sqrrl Data, Inc 22
  • 23.
    DO TRY THISAT HOME Hints for Accumulo Application Optimization With these steps, we achieved 6X speedup: • Perform comparisons on serialized objects • With Map/Reduce, calculate how many merge steps are needed • Avoid premature data inflation • Leverage compression to shift bottlenecks • Always consider how fast your code should run ©2014 Sqrrl Data, Inc 23
  • 24.
    SOME CURRENT ACCUMULO PERFORMANCE PROJECTS • Optimize metadata operations • Batch to improve throughput (ACCUMULO-2175, ACCUMULO-2889) • Remove from critical path where possible • Optimize write-ahead log performance • Maximize throughput • Reduce flushes • Parallelize WALs (ACCUMULO-1083) • Avoid downtime by pre-allocating ©2014 Sqrrl Data, Inc 24
  • 25.
    Securely explore yourdata SQRRL IS HIRING! QUESTIONS? Chris McCubbin Director of Data Science Sqrrl Data, Inc.