Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Large Scale Accumulo Clusters

Discussion of how to design apps for scaling on Accumulo clusters from 10 to 10,000 machines.

Large Scale Accumulo Clusters

  1. 1. Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014
  2. 2. Scale, Security, Schema
  3. 3. Scale
  4. 4. to scale1 - (vt) to change the size of something
  5. 5. “let’s scale the cluster up to twice the original size”
  6. 6. to scale2 - (vi) to function properly at a large scale
  7. 7. “Accumulo scales”
  8. 8. What is Large Scale?
  9. 9. Notebook Computer • 16 GB DRAM • 512 GB Flash Storage • 2.3 GHz quad-core i7 CPU
  10. 10. Modern Server • 100s of GB DRAM • 10s of TB on disk • 10s of cores
  11. 11. Large Scale Laptop Server 10 Node Cluster 100 Nodes 1000 Nodes 10,000 Nodes 10 GB 100 GB 1 TB 10 TB 100 TB 1 PB 10 PB 100 PB In RAM On Disk
  12. 12. Data Composition 0 45 90 135 180 January February March April Original Raw Derivative QFDs Indexes
  13. 13. Accumulo Scales • From GB to PB, Accumulo keeps two things low: • Administrative effort • Scan latency
  14. 14. Scan Latency 0 0.013 0.025 0.038 0.05 0 250 500 750 1000
  15. 15. Administrative Overhead 0 3 6 9 12 0 250 500 750 1000 Failed Machines Admin Intervention
  16. 16. Accumulo Scales • From GB to PB three things grow linearly: • Total storage size • Ingest Rate • Concurrent scans
  17. 17. Ingest Benchmark 0 25 50 75 100 0 250 500 750 1000 Millionsofentriespersecond
  18. 18. AWB Benchmark http://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf
  19. 19. 1000 machines
  20. 20. 100 M entries written per second
  21. 21. 408 terabytes
  22. 22. 7.56 trillion total entries
  23. 23. Graph Benchmark http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
  24. 24. 1200 machines
  25. 25. 4.4 trillion vertices
  26. 26. 70.4 trillion edges
  27. 27. 149 M edges traversed per second
  28. 28. 1 petabyte
  29. 29. Graph Analysis Billions of Edges 1 100 10000 Twitter Yahoo! Facebook Accumulo 70,000 1,000 6.6 1.5
  30. 30. Accumulo is designed after Google’s BigTable
  31. 31. BigTable powers hundreds of applications at Google
  32. 32. BigTable serves 2+ exabytes http://hbasecon.com/sessions/#session33
  33. 33. 600 M queries per second organization wide
  34. 34. From 10 to 10,000
  35. 35. Starting with ten machines 101
  36. 36. One rack
  37. 37. 1 TB RAM
  38. 38. 10-100 TB Disk
  39. 39. Hardware failures rare
  40. 40. Test Application Designs
  41. 41. Designing Applications for Scale
  42. 42. Keys to Scaling 1. Live writes go to all servers 2. User requests are satisfied by few scans 3. Turning updates into inserts
  43. 43. Keys to Scaling Writes on all servers Few Scans
  44. 44. Hash / UUID Keys RowID Col Value af362de4 Bob b23dc4be Annie b98de2ff Joe c48e2ade $30 c7e43fb2 $25 d938ff3d 32 e2e4dac4 59 e98f2eab3 43 Key Value userA:name Bob userA:age 43 userA:account $30 userB:name Annie userB:age 32 userB:account $25 userC:name Joe userC:age 59 Uniform writes
  45. 45. Monitor Participating Tablet Servers MyTable Servers Hosted Tablets … Ingest r1n1 1500 200k r1n2 1501 210k r2n1 1499 190k r2n2 1500 200k
  46. 46. Hash / UUID Keys RowID Col Value af362de4 Bob b23dc4be Annie b98de2ff Joe c48e2ade $30 c7e43fb2 $25 d938ff3d 32 e2e4dac4 59 e98f2eab3 43 3 x 1-entry scans on 3 servers get(userA)
  47. 47. Keys to Scaling Writes on all servers Few Scans Hash / UUID Keys
  48. 48. Group for Locality Key Value userA:name Bob userA:age 43 userB:name Annie userB:age 32 userC:name Fred userC:age 29 userD:name Joe userD:age 59 Key Value userA:name Bob userA:age 43 userA:account $30 userB:name Annie userB:age 32 userB:account $25 userC:name Joe userC:age 59 RowID Col Value af362de4 name Annie af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 Still fairly uniform writes
  49. 49. Group for Locality RowID Col Value af362de4 name Annie af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 1 x 3-entry scan on 1 server get(userA)
  50. 50. Keys to Scaling Writes on all servers Few Scans Grouped Keys
  51. 51. Temporal Keys Key Value userA:name Bob userA:age 43 userB:name Annie userB:age 32 userC:name Fred userC:age 29 userD:name Joe userD:age 59 Key Value 20140101 44 20140102 22 20140103 23 RowID Col Value 20140101 44 20140102 22 20140103 23
  52. 52. Temporal Keys Key Value userA:name Bob userA:age 43 userB:name Annie userB:age 32 userC:name Fred userC:age 29 userD:name Joe userD:age 59 Key Value 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31 RowID Col Value 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31
  53. 53. Temporal Keys Key Value userA:name Bob userA:age 43 userB:name Annie userB:age 32 userC:name Fred userC:age 29 userD:name Joe userD:age 59 Key Value 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31 20140106 27 20140107 25 20140108 17 RowID Col Value 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31 20140106 27 20140107 25 20140108 17 Always write to one server
  54. 54. No write parallelism
  55. 55. Temporal Keys RowID Col Value 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31 20140106 27 20140107 25 20140108 17 Fetching ranges uses few scans get(20140101 to 201404)
  56. 56. Keys to Scaling Writes on all servers Few Scans Temporal Keys
  57. 57. Binned Temporal Keys Key Value userA:name Bob userA:age 43 userB:name Annie userB:age 32 userC:name Fred userC:age 29 userD:name Joe userD:age 59 Key Value 20140101 44 20140102 22 20140103 23 RowID Col Value 0_20140101 44 1_20140102 22 2_20140103 23 Uniform Writes
  58. 58. Binned Temporal Keys Key Value userA:name Bob userA:age 43 userB:name Annie userB:age 32 userC:name Fred userC:age 29 userD:name Joe userD:age 59 Key Value 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31 20140106 27 RowID Col Value 0_20140101 44 0_20140104 25 1_20140102 22 1_20140105 31 2_20140103 23 2_20140106 27 Uniform Writes
  59. 59. Binned Temporal Keys Key Value userA:name Bob userA:age 43 userB:name Annie userB:age 32 userC:name Fred userC:age 29 userD:name Joe userD:age 59 Key Value 20140101 44 20140102 22 20140103 23 20140104 25 20140105 31 20140106 27 20140107 25 20140108 17 RowID Col Value 0_20140101 44 0_20140104 25 0_20140107 25 1_20140102 22 1_20140105 31 1_20140108 17 2_20140103 23 2_20140106 27 Uniform Writes
  60. 60. Binned Temporal Keys RowID Col Value 0_20140101 44 0_20140104 25 0_20140107 25 1_20140102 22 1_20140105 31 1_20140108 17 2_20140103 23 2_20140106 27 One scan per bin get(20140101 to 201404)
  61. 61. Keys to Scaling Writes on all servers Few Scans Binned Temporal Keys
  62. 62. Keys to Scaling • Key design is critical • Group data under common row IDs to reduce scans • Prepend bins to row IDs to increase write parallelism
  63. 63. Splits • Pre-split or organic splits • Going from dev to production, can ingest a representative sample, obtain split points and use them to pre-split a larger system • Hundreds or thousands of tablets per server is ok • Want at least one tablet per server
  64. 64. Effect of Compression • Similar sorted keys compress well • May need more data than you think to auto-split
  65. 65. Inserts are fast 10s of thousands per second per machine
  66. 66. Updates *can* be …
  67. 67. Update Types • Overwrite • Combine • Complex
  68. 68. Update - Overwrite • Performance same as insert • Ignore (don’t read) existing value • Accumulo’s Versioning Iterator does the overwrite
  69. 69. Update - Overwrite RowID Col Value af362de4 name Annie af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 userB:age -> 34
  70. 70. Update - Overwrite RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 userB:age -> 34
  71. 71. Update - Combine • Things like X = X + 1 • Normally one would have to read the old value to do this, but Accumulo Iterators allow multiple inserts to be combined at scan time, or compaction • Performance is same as inserts
  72. 72. Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 userB:account -> +10
  73. 73. Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $25 af362de4 account $10 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 userB:account -> +10
  74. 74. Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $25 af362de4 account $10 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 getAccount(userB) $35
  75. 75. Update - Combine After compaction RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30
  76. 76. Update - Complex • Some updates require looking at more data than Iterators have access to - such as multiple rows • These require reading the data out in order to write the new value • Performance will be much slower
  77. 77. Update - Complex userC:account = getBalance(userA) + getBalance(userB) RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe c48e2ade age 59 c48e2ade account $40 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 35+30 = 65
  78. 78. Update - Complex userC:account = getBalance(userA) + getBalance(userB) RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe c48e2ade age 59 c48e2ade account $65 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 35+30 = 65
  79. 79. Planning a Larger-Scale Cluster 102 - 104
  80. 80. Storage vs Ingest 1 1000 1000000 10 100 1000 10000 Ingest Rate 1x1TB 12x3TB 120,000 12,000 1,200 120 10,000 1,000 100 10 StorageTerabytes MillionsofEntriespersecond
  81. 81. Model for Ingest Rates A = 0.85log2N * N * S N - Number of machines S - Single Server throughput (entries / second) A - Aggregate Cluster throughput (entries / second) Expect 85% increase in write rate when doubling the size of the cluster
  82. 82. Estimating Machines Required N = 2 (log2(A/S) / 0.7655347) N - Number of machines S - Single Server throughput (entries / second) A - Target Aggregate throughput (entries / second) Expect 85% increase in write rate when doubling the size of the cluster
  83. 83. Predicted Cluster Sizes NumberofMachines 0 3000 6000 9000 12000 Millions of Entries per Second 0 150 300 450 600
  84. 84. 100 Machines 102
  85. 85. Multiple racks
  86. 86. 10 TB RAM
  87. 87. 100 TB - 1PB Disk
  88. 88. Some hardware failures in the first week (burn in)
  89. 89. Expect 3 failed HDs in first 3 mo
  90. 90. Another 4 within the first year http://static.googleusercontent.com/media/ research.google.com/en/us/archive/disk_failures.pdf
  91. 91. Can process the 1000 Genomes data set 260 TB www.1000genomes.org
  92. 92. Can store and index the Common Crawl Corpus ! 2.8 Billion web pages 541 TB commoncrawl.org
  93. 93. One year of Twitter 182 trillion tweets 483 TB http://www.sec.gov/Archives/edgar/data/ 1418091/000119312513390321/d564001ds1.htm
  94. 94. Deploying an Application Tablet ServersClientsUsers
  95. 95. May not see the affect of writing to disk for a while
  96. 96. 1000 machines 103
  97. 97. Multiple rows of racks
  98. 98. 100 TB RAM
  99. 99. 1-10 PB Disk
  100. 100. Hardware failure is a regular occurrence
  101. 101. Hard drive failure about every 5 days (average). ! Will be skewed towards beginning of the year
  102. 102. Can traverse the ‘brain graph’ 70 trillion edges, 1 PB http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
  103. 103. Facebook Graph 1s of PB http://www-conf.slac.stanford.edu/xldb2012/talks/ xldb2012_wed_1105_DhrubaBorthakur.pdf
  104. 104. Netflix Video Master Copies 3.14 PB http://www.businessweek.com/articles/2013-05-09/netflix-reed- hastings-survive-missteps-to-join-silicon-valleys-elite
  105. 105. World of Warcraft Backend Storage 1.3 PB http://www.datacenterknowledge.com/archives/2009/11/25/ wows-back-end-10-data-centers-75000-cores/
  106. 106. Webpages, live on the Internet 14.3 Trillion http://www.factshunt.com/2014/01/ total-number-of-websites-size-of.html
  107. 107. Things like the difference between two compression algorithms start to make a big difference
  108. 108. Use range compactions to affect changes on portions of table
  109. 109. Lay off Zookeeper
  110. 110. Watch Garbage Collector and Namenode ops
  111. 111. Garbage Collection > 5 minutes?
  112. 112. Start thinking about NameNode Federation
  113. 113. Accumulo 1.6
  114. 114. Multiple NameNodes Accumulo Namenode Namenode DataNodesDataNodes Multiple HDFS Clusters
  115. 115. Multiple NameNodes Accumulo DataNodes Multiple NameNodes, shared DataNodes (Federation. Requires Hadoop 2.0) Namenode Namenode
  116. 116. More Namenodes = higher risk of one going down. ! Can use HA Namenodes in conjunction w/ Federation
  117. 117. 10,000 machines 104
  118. 118. You, my friend, are here to kick a** and chew bubble gum
  119. 119. 1 PB RAM
  120. 120. 10-100 PB Disk
  121. 121. 1 hardware failure every hour on average
  122. 122. Entire Internet Archive 15 PB http://www.motherjones.com/media/2014/05/ internet-archive-wayback-machine-brewster-kahle
  123. 123. A year’s worth of data from the Large Hadron Collider 15 PB http://home.web.cern.ch/about/computing
  124. 124. 0.1% of all Internet traffic in 2013 43.6 PB http://www.factshunt.com/2014/01/ total-number-of-websites-size-of.html
  125. 125. Facebook Messaging Data 10s of PB http://www-conf.slac.stanford.edu/xldb2012/talks/ xldb2012_wed_1105_DhrubaBorthakur.pdf
  126. 126. Facebook Photos 240 billion High 10s of PB http://www-conf.slac.stanford.edu/xldb2012/talks/ xldb2012_wed_1105_DhrubaBorthakur.pdf
  127. 127. Must use multiple NameNodes
  128. 128. Can tune back heartbeats, periodicity of central processes in general
  129. 129. Can combine multiple PB data sets
  130. 130. Up to 10 quadrillion entries in a single table
  131. 131. While maintaining sub-second lookup times
  132. 132. Only with Accumulo 1.6
  133. 133. Dealing with data over time
  134. 134. Data Over Time - Patterns • Initial Load • Increasing Velocity • Focus on Recency • Historical Summaries
  135. 135. Initial Load • Get a pile of old data into Accumulo fast • Latency not important (data is old) • Throughput critical
  136. 136. Bulk Load RFiles
  137. 137. Bulk Loading MapReduce RFiles Accumulo
  138. 138. Increasing velocity
  139. 139. If your data isn’t big today, wait a little while
  140. 140. Accumulo scales up dynamically, online. No downtime
  141. 141. The first scale, ‘can change size’
  142. 142. Scaling Up Clients Accumulo HDFS 3 physical servers Each running a Tablet Server process and a Data Node process
  143. 143. Scaling Up Clients Accumulo HDFS Start 3 new Tablet Server procs 3 new Data node processes
  144. 144. Scaling Up Clients Accumulo HDFS master immediately assigns tablets
  145. 145. Scaling Up Clients Accumulo HDFS Clients immediately begin querying new Tablet Servers
  146. 146. Scaling Up Clients Accumulo HDFS new Tablet Servers read data from old Data nodes
  147. 147. Scaling Up Clients Accumulo HDFS new Tablet Servers write data to new Data Nodes
  148. 148. Never really seen anyone do this
  149. 149. Except myself
  150. 150. 20 machines in Amazon EC2
  151. 151. to 400 machines
  152. 152. all during the same MapReduce job reading data out of Accumulo, summarizing, and writing back
  153. 153. Scaled back down to 20 machines when done
  154. 154. Just killed Tablet Servers
  155. 155. Decommissioned Data Nodes for safe data consolidation to remaining 20 nodes
  156. 156. Other ways to go from 10x to 10x+1
  157. 157. Accumulo Table Export
  158. 158. followed by HDFS DistCP to new cluster
  159. 159. Maybe new replication feature
  160. 160. Newer Data is Read more Often
  161. 161. Accumulo keeps newly written data in memory
  162. 162. Block Cache can keep recently queried data in memory
  163. 163. Combining Iterators make maintaining summaries of large amounts of raw events easy
  164. 164. Reduces storage burden
  165. 165. Historical Summaries 0 2000 4000 6000 8000 April May June July Unique Entities Stored Raw Events Processed
  166. 166. Age-off iterator can automatically remove data over a certain age
  167. 167. IBM estimates 2.5 exabytes of data is created every day http://www-01.ibm.com/software/data/bigdata/ what-is-big-data.html
  168. 168. 90% of available data created in last 2 years http://www-01.ibm.com/software/data/bigdata/ what-is-big-data.html
  169. 169. 25 new 10k node Accumulo clusters per day
  170. 170. Accumulo is doing it’s part to get in front of the big data trend
  171. 171. Questions ?
  172. 172. @aaroncordova

×