Compaction and
Splitting in Apache
Accumulo
Billie Rinaldi
billie@hortonworks.com
October 24, 2012
Ā© Hortonworks Inc. 2012   Page 1
What are compaction and splitting?

•Accumulo tables are divided into
 non-overlapping key ranges called
 tablets
•Compaction selects a set of sorted
 files for a single tablet and rewrites
 them into one file
•Splitting divides a tablet into two
 tablets

                                      Page 2
    Ā© Hortonworks Inc. 2012
Tablet Overview

•When memory fills, new sorted files
 are created by flushing
•Sorted files are combined together
 into fewer sorted files




                                   Page 3
   Ā© Hortonworks Inc. 2012
How much data are you writing?

•If you never compact – O(N)
                                 …
•If you always compact – O(N2)


                             …

                                     Page 4
   Ā© Hortonworks Inc. 2012
Accumulo Compaction Algorithm

•Compact a set of files when:

size of the
largest file
                               Ɨ
                               compaction
                                  ratio     ≤    sum of the
                                                sizes of files



   table.compaction.major.ratio



                                                                 Page 5
     Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 1, W = 1)




                                  Page 6
    Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 2, W = 2)




                                  Page 7
    Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 3, W = 3)




                                  Page 8
    Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 3, W = 6)




                                  Page 9
    Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 4, W = 7)




                                  Page 10
    Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 5, W = 8)




                                  Page 11
    Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 6, W = 9)




                                  Page 12
    Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 6, W = 12)




                                   Page 13
    Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 7, W = 13)




                                   Page 14
    Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 8, W = 14)




                                   Page 15
    Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 9, W = 15)




                                   Page 16
    Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 9, W = 24)




                                   Page 17
    Ā© Hortonworks Inc. 2012
In Action (r = 3, N = 27, W = 90*)




                                     Page 18
    Ā© Hortonworks Inc. 2012
Amount of data written

•W(rk) = (k+1)rk – (k-1)rk-1
•Thus, W(N) ā‰ˆ O(N log N)




                               Page 19
    Ā© Hortonworks Inc. 2012
HBase Compaction Algorithm

•Compact a set of files when:

                                    sum of the
size of the
largest file                   ≤     sizes of      Ɨ
                                                   compaction
                                                      ratio
                                   smaller files


  hbase.hstore.compaction.ratio



                                                            Page 20
     Ā© Hortonworks Inc. 2012
HBase Compaction Algorithm

•Compact a set of files when:

                                    sum of the
size of the
largest file                   ≤     sizes of      Ɨ   compaction
                                                          ratio
                                   smaller files


                                                   1
    HBase ratio                     =     Accumulo
                                            ratio       –1
                                                                Page 21
     Ā© Hortonworks Inc. 2012
Other Compaction-related Properties

•Accumulo
  table.file.max
  tserver.compaction.major.thread.files.open.max
  tserver.compaction.major.delay
  table.compaction.major.everything.idle

•Hbase
  hbase.hstore.compactionThreshold
  hbase.hstore.blockingStoreFiles
  hbase.hstore.blockingWaitTime
  hbase.hstore.compaction.min
  hbase.hstore.compaction.max
  hbase.hstore.compaction.min.size
  hbase.hstore.compaction.max.size
                                                   Page 22
    Ā© Hortonworks Inc. 2012
Accumulo Splitting

•Always check to see if a split is
 needed before compacting
•If it is needed, split first
•File names stored in metadata table




   split
threshold
                                  Page 23
      Ā© Hortonworks Inc. 2012
Accumulo Splitting Process

•Tablet closed, no new writes
•Three writes to the metadata table
 –tablet made smaller & marked as splitting
 –new tablet added
 –original tablet's splitting marks removed
•Tablet server swaps new tablets for
 old tablet in its online tablet list
•Master informed
                                        Page 24
   Ā© Hortonworks Inc. 2012
Accumulo Splitting Recovery

•Whenever a tablet is brought online,
 the tablet server checks to see if it
 has split marks.
•If so, it assumes the splitting
 process was interrupted and
 finishes making changes to the
 metadata table.


                                    Page 25
   Ā© Hortonworks Inc. 2012
Hortonworks Data Platform
                                                     • Simplify deployment to get
                                                       started quickly and easily

                                                     • Monitor, manage any size
                                                       cluster with familiar
                                                       console and tools


                                1                    • Only platform to include
                                                       data integration services
                                                       to interact with any data

                                                     • Metadata services opens
                                                       the platform for integration
                                                       with existing applications

                                                     • Dependable high
                                                       availability architecture
 Reduce risks and cost of adoption
 Lower the total cost to administer and provision   • Tested at scale to future
                                                       proof your cluster growth
 Integrate with your existing ecosystem

                                                                             Page 26
      Ā© Hortonworks Inc. 2012
Hortonworks Training

                         The expert source for
                         Apache Hadoop training &
                         certification

Role-based Developer and
Administration training
 – Coursework built and maintained by the core Apache Hadoop development team.
 – The ā€œrightā€ course, with the most extensive and realistic hands-on materials
 – Provide an immersive experience into real-world Hadoop scenarios
 – Public and Private courses available




Comprehensive Apache Hadoop
     Ā© Hortonworks Inc. 2012
                                                                            Page 27
Next Steps?

1                                 Download Hortonworks Data Platform
                                  hortonworks.com/download




2   Use the getting started guide
    hortonworks.com/get-started



3   Learn more… get support

                                                             Hortonworks Support
       • Expert role based training                          • Full lifecycle technical support
       • Course for admins, developers                         across four service levels
         and operators                                       • Delivered by Apache Hadoop
       • Certification program                                 Experts/Committers
       • Custom onsite options                               • Forward-compatible
       hortonworks.com/training                              hortonworks.com/support


                                                                                                  Page 28
        Ā© Hortonworks Inc. 2012
Questions?
dev@accumulo.apache.org




                              Page 29
    Ā© Hortonworks Inc. 2012

Compaction and Splitting in Apache Accumulo

  • 1.
    Compaction and Splitting inApache Accumulo Billie Rinaldi billie@hortonworks.com October 24, 2012 Ā© Hortonworks Inc. 2012 Page 1
  • 2.
    What are compactionand splitting? •Accumulo tables are divided into non-overlapping key ranges called tablets •Compaction selects a set of sorted files for a single tablet and rewrites them into one file •Splitting divides a tablet into two tablets Page 2 Ā© Hortonworks Inc. 2012
  • 3.
    Tablet Overview •When memoryfills, new sorted files are created by flushing •Sorted files are combined together into fewer sorted files Page 3 Ā© Hortonworks Inc. 2012
  • 4.
    How much dataare you writing? •If you never compact – O(N) … •If you always compact – O(N2) … Page 4 Ā© Hortonworks Inc. 2012
  • 5.
    Accumulo Compaction Algorithm •Compacta set of files when: size of the largest file Ɨ compaction ratio ≤ sum of the sizes of files table.compaction.major.ratio Page 5 Ā© Hortonworks Inc. 2012
  • 6.
    In Action (r= 3, N = 1, W = 1) Page 6 Ā© Hortonworks Inc. 2012
  • 7.
    In Action (r= 3, N = 2, W = 2) Page 7 Ā© Hortonworks Inc. 2012
  • 8.
    In Action (r= 3, N = 3, W = 3) Page 8 Ā© Hortonworks Inc. 2012
  • 9.
    In Action (r= 3, N = 3, W = 6) Page 9 Ā© Hortonworks Inc. 2012
  • 10.
    In Action (r= 3, N = 4, W = 7) Page 10 Ā© Hortonworks Inc. 2012
  • 11.
    In Action (r= 3, N = 5, W = 8) Page 11 Ā© Hortonworks Inc. 2012
  • 12.
    In Action (r= 3, N = 6, W = 9) Page 12 Ā© Hortonworks Inc. 2012
  • 13.
    In Action (r= 3, N = 6, W = 12) Page 13 Ā© Hortonworks Inc. 2012
  • 14.
    In Action (r= 3, N = 7, W = 13) Page 14 Ā© Hortonworks Inc. 2012
  • 15.
    In Action (r= 3, N = 8, W = 14) Page 15 Ā© Hortonworks Inc. 2012
  • 16.
    In Action (r= 3, N = 9, W = 15) Page 16 Ā© Hortonworks Inc. 2012
  • 17.
    In Action (r= 3, N = 9, W = 24) Page 17 Ā© Hortonworks Inc. 2012
  • 18.
    In Action (r= 3, N = 27, W = 90*) Page 18 Ā© Hortonworks Inc. 2012
  • 19.
    Amount of datawritten •W(rk) = (k+1)rk – (k-1)rk-1 •Thus, W(N) ā‰ˆ O(N log N) Page 19 Ā© Hortonworks Inc. 2012
  • 20.
    HBase Compaction Algorithm •Compacta set of files when: sum of the size of the largest file ≤ sizes of Ɨ compaction ratio smaller files hbase.hstore.compaction.ratio Page 20 Ā© Hortonworks Inc. 2012
  • 21.
    HBase Compaction Algorithm •Compacta set of files when: sum of the size of the largest file ≤ sizes of Ɨ compaction ratio smaller files 1 HBase ratio = Accumulo ratio –1 Page 21 Ā© Hortonworks Inc. 2012
  • 22.
    Other Compaction-related Properties •Accumulo table.file.max tserver.compaction.major.thread.files.open.max tserver.compaction.major.delay table.compaction.major.everything.idle •Hbase hbase.hstore.compactionThreshold hbase.hstore.blockingStoreFiles hbase.hstore.blockingWaitTime hbase.hstore.compaction.min hbase.hstore.compaction.max hbase.hstore.compaction.min.size hbase.hstore.compaction.max.size Page 22 Ā© Hortonworks Inc. 2012
  • 23.
    Accumulo Splitting •Always checkto see if a split is needed before compacting •If it is needed, split first •File names stored in metadata table split threshold Page 23 Ā© Hortonworks Inc. 2012
  • 24.
    Accumulo Splitting Process •Tabletclosed, no new writes •Three writes to the metadata table –tablet made smaller & marked as splitting –new tablet added –original tablet's splitting marks removed •Tablet server swaps new tablets for old tablet in its online tablet list •Master informed Page 24 Ā© Hortonworks Inc. 2012
  • 25.
    Accumulo Splitting Recovery •Whenevera tablet is brought online, the tablet server checks to see if it has split marks. •If so, it assumes the splitting process was interrupted and finishes making changes to the metadata table. Page 25 Ā© Hortonworks Inc. 2012
  • 26.
    Hortonworks Data Platform • Simplify deployment to get started quickly and easily • Monitor, manage any size cluster with familiar console and tools 1 • Only platform to include data integration services to interact with any data • Metadata services opens the platform for integration with existing applications • Dependable high availability architecture  Reduce risks and cost of adoption  Lower the total cost to administer and provision • Tested at scale to future proof your cluster growth  Integrate with your existing ecosystem Page 26 Ā© Hortonworks Inc. 2012
  • 27.
    Hortonworks Training The expert source for Apache Hadoop training & certification Role-based Developer and Administration training – Coursework built and maintained by the core Apache Hadoop development team. – The ā€œrightā€ course, with the most extensive and realistic hands-on materials – Provide an immersive experience into real-world Hadoop scenarios – Public and Private courses available Comprehensive Apache Hadoop Ā© Hortonworks Inc. 2012 Page 27
  • 28.
    Next Steps? 1 Download Hortonworks Data Platform hortonworks.com/download 2 Use the getting started guide hortonworks.com/get-started 3 Learn more… get support Hortonworks Support • Expert role based training • Full lifecycle technical support • Course for admins, developers across four service levels and operators • Delivered by Apache Hadoop • Certification program Experts/Committers • Custom onsite options • Forward-compatible hortonworks.com/training hortonworks.com/support Page 28 Ā© Hortonworks Inc. 2012
  • 29.
    Questions? dev@accumulo.apache.org Page 29 Ā© Hortonworks Inc. 2012

Editor's Notes

  • #27Ā Hortonworks Data Platform (HDP) is the only 100% open source Apache Hadoop distribution that provides a complete and reliable foundation for enterprises that want to build, deploy and manage big data solutions. It allows you to confidently capture, process and share data in any format, at scale on commodity hardware and/or in a cloud environment. As the foundation for the next generation enterprise data architecture, HDP delivers all of the necessary components to uncover business insights from the growing streams of data flowing into and throughout your business. HDP is a fully integrated data platform that includes the stable core functions of Apache Hadoop (HDFS and MapReduce), the baseline tools to process big data (Apache Hive, Apache HBase, Apache Pig) as well as a set of advanced capabilities (Apache Ambari, Apache HCatalog and High Availability) that make big data operational and ready for the enterprise. Ā Run through the points on left…