Compaction andSplitting in ApacheAccumuloBillie Rinaldibillie@hortonworks.comOctober 24, 2012© Hortonworks Inc. 2012   Pag...
What are compaction and splitting?•Accumulo tables are divided into non-overlapping key ranges called tablets•Compaction s...
Tablet Overview•When memory fills, new sorted files are created by flushing•Sorted files are combined together into fewer ...
How much data are you writing?•If you never compact – O(N)                                 …•If you always compact – O(N2)...
Accumulo Compaction Algorithm•Compact a set of files when:size of thelargest file                               ×         ...
In Action (r = 3, N = 1, W = 1)                                  Page 6    © Hortonworks Inc. 2012
In Action (r = 3, N = 2, W = 2)                                  Page 7    © Hortonworks Inc. 2012
In Action (r = 3, N = 3, W = 3)                                  Page 8    © Hortonworks Inc. 2012
In Action (r = 3, N = 3, W = 6)                                  Page 9    © Hortonworks Inc. 2012
In Action (r = 3, N = 4, W = 7)                                  Page 10    © Hortonworks Inc. 2012
In Action (r = 3, N = 5, W = 8)                                  Page 11    © Hortonworks Inc. 2012
In Action (r = 3, N = 6, W = 9)                                  Page 12    © Hortonworks Inc. 2012
In Action (r = 3, N = 6, W = 12)                                   Page 13    © Hortonworks Inc. 2012
In Action (r = 3, N = 7, W = 13)                                   Page 14    © Hortonworks Inc. 2012
In Action (r = 3, N = 8, W = 14)                                   Page 15    © Hortonworks Inc. 2012
In Action (r = 3, N = 9, W = 15)                                   Page 16    © Hortonworks Inc. 2012
In Action (r = 3, N = 9, W = 24)                                   Page 17    © Hortonworks Inc. 2012
In Action (r = 3, N = 27, W = 90*)                                     Page 18    © Hortonworks Inc. 2012
Amount of data written•W(rk) = (k+1)rk – (k-1)rk-1•Thus, W(N) ≈ O(N log N)                               Page 19    © Hort...
HBase Compaction Algorithm•Compact a set of files when:                                    sum of thesize of thelargest fi...
HBase Compaction Algorithm•Compact a set of files when:                                    sum of thesize of thelargest fi...
Other Compaction-related Properties•Accumulo  table.file.max  tserver.compaction.major.thread.files.open.max  tserver.comp...
Accumulo Splitting•Always check to see if a split is needed before compacting•If it is needed, split first•File names stor...
Accumulo Splitting Process•Tablet closed, no new writes•Three writes to the metadata table –tablet made smaller & marked a...
Accumulo Splitting Recovery•Whenever a tablet is brought online, the tablet server checks to see if it has split marks.•If...
Hortonworks Data Platform                                                     • Simplify deployment to get                ...
Hortonworks Training                         The expert source for                         Apache Hadoop training &       ...
Next Steps?1                                 Download Hortonworks Data Platform                                  hortonwor...
Questions?dev@accumulo.apache.org                              Page 29    © Hortonworks Inc. 2012
Upcoming SlideShare
Loading in...5
×

Compaction and Splitting in Apache Accumulo

3,195

Published on

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,195
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
73
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • Hortonworks Data Platform (HDP) is the only 100% open source Apache Hadoop distribution that provides a complete and reliable foundation for enterprises that want to build, deploy and manage big data solutions. It allows you to confidently capture, process and share data in any format, at scale on commodity hardware and/or in a cloud environment. As the foundation for the next generation enterprise data architecture, HDP delivers all of the necessary components to uncover business insights from the growing streams of data flowing into and throughout your business. HDP is a fully integrated data platform that includes the stable core functions of Apache Hadoop (HDFS and MapReduce), the baseline tools to process big data (Apache Hive, Apache HBase, Apache Pig) as well as a set of advanced capabilities (Apache Ambari, Apache HCatalog and High Availability) that make big data operational and ready for the enterprise.  Run through the points on left…
  • Compaction and Splitting in Apache Accumulo

    1. 1. Compaction andSplitting in ApacheAccumuloBillie Rinaldibillie@hortonworks.comOctober 24, 2012© Hortonworks Inc. 2012 Page 1
    2. 2. What are compaction and splitting?•Accumulo tables are divided into non-overlapping key ranges called tablets•Compaction selects a set of sorted files for a single tablet and rewrites them into one file•Splitting divides a tablet into two tablets Page 2 © Hortonworks Inc. 2012
    3. 3. Tablet Overview•When memory fills, new sorted files are created by flushing•Sorted files are combined together into fewer sorted files Page 3 © Hortonworks Inc. 2012
    4. 4. How much data are you writing?•If you never compact – O(N) …•If you always compact – O(N2) … Page 4 © Hortonworks Inc. 2012
    5. 5. Accumulo Compaction Algorithm•Compact a set of files when:size of thelargest file × compaction ratio ≤ sum of the sizes of files table.compaction.major.ratio Page 5 © Hortonworks Inc. 2012
    6. 6. In Action (r = 3, N = 1, W = 1) Page 6 © Hortonworks Inc. 2012
    7. 7. In Action (r = 3, N = 2, W = 2) Page 7 © Hortonworks Inc. 2012
    8. 8. In Action (r = 3, N = 3, W = 3) Page 8 © Hortonworks Inc. 2012
    9. 9. In Action (r = 3, N = 3, W = 6) Page 9 © Hortonworks Inc. 2012
    10. 10. In Action (r = 3, N = 4, W = 7) Page 10 © Hortonworks Inc. 2012
    11. 11. In Action (r = 3, N = 5, W = 8) Page 11 © Hortonworks Inc. 2012
    12. 12. In Action (r = 3, N = 6, W = 9) Page 12 © Hortonworks Inc. 2012
    13. 13. In Action (r = 3, N = 6, W = 12) Page 13 © Hortonworks Inc. 2012
    14. 14. In Action (r = 3, N = 7, W = 13) Page 14 © Hortonworks Inc. 2012
    15. 15. In Action (r = 3, N = 8, W = 14) Page 15 © Hortonworks Inc. 2012
    16. 16. In Action (r = 3, N = 9, W = 15) Page 16 © Hortonworks Inc. 2012
    17. 17. In Action (r = 3, N = 9, W = 24) Page 17 © Hortonworks Inc. 2012
    18. 18. In Action (r = 3, N = 27, W = 90*) Page 18 © Hortonworks Inc. 2012
    19. 19. Amount of data written•W(rk) = (k+1)rk – (k-1)rk-1•Thus, W(N) ≈ O(N log N) Page 19 © Hortonworks Inc. 2012
    20. 20. HBase Compaction Algorithm•Compact a set of files when: sum of thesize of thelargest file ≤ sizes of × compaction ratio smaller files hbase.hstore.compaction.ratio Page 20 © Hortonworks Inc. 2012
    21. 21. HBase Compaction Algorithm•Compact a set of files when: sum of thesize of thelargest file ≤ sizes of × compaction ratio smaller files 1 HBase ratio = Accumulo ratio –1 Page 21 © Hortonworks Inc. 2012
    22. 22. Other Compaction-related Properties•Accumulo table.file.max tserver.compaction.major.thread.files.open.max tserver.compaction.major.delay table.compaction.major.everything.idle•Hbase hbase.hstore.compactionThreshold hbase.hstore.blockingStoreFiles hbase.hstore.blockingWaitTime hbase.hstore.compaction.min hbase.hstore.compaction.max hbase.hstore.compaction.min.size hbase.hstore.compaction.max.size Page 22 © Hortonworks Inc. 2012
    23. 23. Accumulo Splitting•Always check to see if a split is needed before compacting•If it is needed, split first•File names stored in metadata table splitthreshold Page 23 © Hortonworks Inc. 2012
    24. 24. Accumulo Splitting Process•Tablet closed, no new writes•Three writes to the metadata table –tablet made smaller & marked as splitting –new tablet added –original tablets splitting marks removed•Tablet server swaps new tablets for old tablet in its online tablet list•Master informed Page 24 © Hortonworks Inc. 2012
    25. 25. Accumulo Splitting Recovery•Whenever a tablet is brought online, the tablet server checks to see if it has split marks.•If so, it assumes the splitting process was interrupted and finishes making changes to the metadata table. Page 25 © Hortonworks Inc. 2012
    26. 26. Hortonworks Data Platform • Simplify deployment to get started quickly and easily • Monitor, manage any size cluster with familiar console and tools 1 • Only platform to include data integration services to interact with any data • Metadata services opens the platform for integration with existing applications • Dependable high availability architecture Reduce risks and cost of adoption Lower the total cost to administer and provision • Tested at scale to future proof your cluster growth Integrate with your existing ecosystem Page 26 © Hortonworks Inc. 2012
    27. 27. Hortonworks Training The expert source for Apache Hadoop training & certificationRole-based Developer andAdministration training – Coursework built and maintained by the core Apache Hadoop development team. – The “right” course, with the most extensive and realistic hands-on materials – Provide an immersive experience into real-world Hadoop scenarios – Public and Private courses availableComprehensive Apache Hadoop © Hortonworks Inc. 2012 Page 27
    28. 28. Next Steps?1 Download Hortonworks Data Platform hortonworks.com/download2 Use the getting started guide hortonworks.com/get-started3 Learn more… get support Hortonworks Support • Expert role based training • Full lifecycle technical support • Course for admins, developers across four service levels and operators • Delivered by Apache Hadoop • Certification program Experts/Committers • Custom onsite options • Forward-compatible hortonworks.com/training hortonworks.com/support Page 28 © Hortonworks Inc. 2012
    29. 29. Questions?dev@accumulo.apache.org Page 29 © Hortonworks Inc. 2012
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×