SlideShare a Scribd company logo
© Hortonworks Inc. 2011
Compaction Improvements in Apache HBase
Sergey Shelukhin
sergey@hortonworks.com
© Hortonworks Inc. 2011
About me
•HBase committer since February 2013
•Member of Technical Staff at Hortonworks
•Twitter @sershe84
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Overview
•What are compactions?
•Default algorithm and improvements
•Enabling different implementations
•Algorithms for various scenarios
•Conclusions
Architecting the Future of Big Data
© Hortonworks Inc. 2011
What are compactions?
© Hortonworks Inc. 2011
What are compactions?
•HBase writes out immutable files as data is added
–Each Store (CF+region) consists of these rowkey-ordered files
–Immutable => more files accumulate over time
–More files => slower reads
•Compaction rewrites several files into one
–Less files => faster reads
• Major compaction rewrites all files in a Store into one
–Can drop deleted records, tombstones and old versions
•In minor compaction, files to compact are selected
based on a heuristic
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Compactions example
Architecting the Future of Big Data
•Memstore fills up, files are flushed
•When enough files accumulate, they are compacted
MemStore
HDFS
writes
HFile
…
HFile HFile HFileHFile
© Hortonworks Inc. 2011
Reads slow down w/o compactions
•If too many files accumulate, reads slow down
•Read latency over time without compactions:
Architecting the Future of Big Data
0
5
10
15
20
25
0 3600 7200 10800 14400
Readlatency,ms.
Load test time, sec
© Hortonworks Inc. 2011
But, compaction cause slowdowns
•Looks like lots of I/O for no apparent benefit
•Example effect on reads (note better average)
Architecting the Future of Big Data
0
5
10
15
20
25
0 3600 7200 10800
Readlatency,ms
Load test time, sec
© Hortonworks Inc. 2011
Default algorithm and improvements
© Hortonworks Inc. 2011
Compaction tradeoffs
•Hbase resolves key conflicts by file age
–Therefore, can only compact contiguous files
•Large compactions are more efficient (less total I/O)
–However, they can cause long slowdown for clients
•Small compactions have less effect on clients
–However, in total you do more rewriting
•We want to compact similar files
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Default algorithm in 0.94
•Ratio-based selection
–Look for files at most F times larger than the following files
–Also allows limiting file numbers and sizes
•Higher ratio => more aggressive (default 1.2)
•Example: 2 files minimum, 3 maximum, ratio 1.2
Architecting the Future of Big Data
HFile HFile HFile HFile HFile
Too big!Too many files!OK.
•Usually good for typical accumulation of flushed files
•Not good for bulk load – unpredictable file sizes!
© Hortonworks Inc. 2011
Off-peak compactions
•Good if you have variable load through the day
•HBASE-4463 - present in 0.94 (since 2011)
•Compact more aggressively during certain hours of
the day, when load is lower
•Set off-peak period via
– hbase.offpeak.start.hour,hbase.offpeak.end.hour (0-23)
•Then, set ratio via
– hbase.hstore.compaction.ratio.offpeak (default is 5)
•Only one "off-peak" compaction at a time, so load is
not totally prohibitive
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Inefficiencies in default algorithm
•First valid selection is chosen
•Ratio is only considered for the first selected file
–Thus, other files in compaction may not be similar
•The solution found may not be the best one
–especially for bulk load, with unpredictable file sizes
Architecting the Future of Big Data
HFile HFile HFile HFile HFile
Matches the ratio, but this is a bad selection
HFile
© Hortonworks Inc. 2011
Exploring compaction selection
•There are usually not so many files, so looking at all
valid permutations and comparing quality is viable
•HBASE-7842 - "exploring" compaction selection
–Ratio checked for each file to choose good permutations
–When store is ok, try to compact the most files
–When store has too many files, try to eliminate some as
fast as possible
•On by default in 0.95/0.96
•Works with your old configuration settings
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Examples and results
•In previous example
Architecting the Future of Big Data
HFile HFile HFile HFile HFile
Not in ratio, dissimilar files
HFile
•On bulk loads of random size, depending on settings:
–loses only 0-10% efficiency in reducing files count;
–While reducing I/O 3-10 times
•Best results with ratio 1.3-1.4, 4 minimum files
In ratio, may be valid… But this has more files!
© Hortonworks Inc. 2011
Enabling different implementations
© Hortonworks Inc. 2011
Making compactions pluggable
•To allow further improvements, the code should be
easy to replace; not the case as of 0.94
•Initial implementation – p/o HBASE-7055, HBASE-7516
– make just the selection pluggable
•This is called "policy" (CompactionPolicy)
•Example usages
–exploring selection, mentioned previously
–tier-based selection (port from Facebook)
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Making compactions more pluggable
• Other potential improvements are more involved
• Need to change other things (HBASE-7678)
• The meta-structure of the files (StoreFileManager, HBASE-7603)
–Group files by some key/time/… based scheme
–In memory/metadata only - filesystem structure or file format
changes would be a compatibility nightmare
–Example – LeveDB-style compactions, stripes
• Compactor to compact the files (Compactor)
–Example – large object store, levels, stripes
• Can replace parts together or separately (StoreEngine)
–E.g. level compactor only makes sense with level-aware store
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Enabling compaction tuning
•Different tables (or even column families) have
different data and access patterns
•Compactions already have large number of knobs
•Starting with 0.96, they can be configured on table/CF
level (HBASE-7236)
•Example from the shell:
alter 'table1', CONFIGURATION => {'hbase.hstore.engine.class' =>
'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', ... }
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Algorithms for various scenarios
© Hortonworks Inc. 2011
Key ways to improve compactions
Architecting the Future of Big Data
• Read from fewer files
–Separate files by row key, version, time, etc.
–Allows large number of files to be present, uncompacted
• Don't compact the data you don't need to compact
–For example, old data in OpenTSDB-like systems
–Obviously, results in less I/O
• Make compactions smaller
–Without too much I/O amplification or too many files
–Results in less compaction-related outages
• HBase works better with few large regions; however, large
compactions cause unavailability
© Hortonworks Inc. 2011
How to avoid large compactions
Architecting the Future of Big Data
•LevelDB compactions
–Files live on multiple levels
–Files on each level have non-overlapping row-key ranges
–…except level 0 (L0), where memstore flushes go
–Compact overlapping subsets of 2 level, data goes up a level
–Most read requests need only one file per level, plus all of L0
•Small compactions, few files per read, however...
–More I/O, as the data moves from level to level
–No major compactions – dropping deletes is not trivial
–Messes up file ordering due to file boundary overlaps
between levels – not readable correctly by default store
© Hortonworks Inc. 2011
Stripe compactions (HBASE-7667)
Architecting the Future of Big Data
• Somewhat like LevelDB, partition the keys inside each region/store
• But, only 1 level (plus optional L0)
• Compared to regions, partitioning is more flexible
–The default is a number of ~equal-sized stripes
• To read, just read relevant stripes + L0, if present
HFile HFile
Region start key: ccc eee
Row-key axis
iii: region end keyggg
H
HFileHFileHFile
HFile L0
get
'hbase'
© Hortonworks Inc. 2011
Stripe compactions – writes
Architecting the Future of Big Data
•Data flushed from MemStore into several files
•Each stripe compacts separately most of the time
MemStore
HDFS
HFile HFile
H
HFileHFileHFile
H
H
H
HFile
© Hortonworks Inc. 2011
Stripe compactions – other
Architecting the Future of Big Data
•Why L0?
–Bulk loaded files go to L0
–Flushes can also go into single L0 files (to avoid tiny files)
–Several L0 files are then compacted into striped files
•Can drop deletes if compacting one entire stripe +L0
–No need for major compactions, ever
•Compact 2 stripes together – rebalance if unbalanced
–Very rare, however - unbalanced stripes are not a huge deal
• Boundaries could be used to improve region splits in future
© Hortonworks Inc. 2011
Stripe compactions - performance
Architecting the Future of Big Data
•EC2, c1.xlarge, preload; then measure random read perf
–LoadTestTool + deletes + overwrites; measure random reads
0
500
1000
1500
2000
2500 3500 4500 5500 6500 7500 8500
Randomgetspersecond
Test time, sec.
Default gets-per-second, 30sec. MA
Stripe gets-per-second, 30sec. MA
© Hortonworks Inc. 2011
Stripe compactions - performance
Architecting the Future of Big Data
• On individual request level: median latency – same (1.6ms)
• However 90th pct - 15% improvement (~13ms to ~11ms),
• 99th pct – 20% improvement (~60 to ~47ms)
• While also sending ~18% more reads in ~4% less time
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10 12 14 16 18 20
Latency (ms) CDF
Default
Stripes (12)
© Hortonworks Inc. 2011
Other stripe boundary schemes
•For sharded sequential keys (like OpenTSDB), compacting
old data again and again is not useful
•What if stripes split dynamically as they grow?
–If data is sequential, only a subset of stripes will grow
–Non-growing stripes never need to be compacted
Architecting the Future of Big Data
HFileHFile HFile HFile
H
H
HFile
HFile
HFile
H
Rowkey space
Too big!
HFile H
Now this will hardly ever compact
© Hortonworks Inc. 2011
Others in development – tier-based
Architecting the Future of Big Data
•Tier-based compaction selection (HBASE-7055;
originally developed in Facebook)
–Old data may not be read as frequently, new data may all
be in cache so # of files does not matter, etc.
–So, during selection, dynamically arrange files into
tiers, and apply different rules (ratios, etc.) to them
•Simple example (only 2 tiers)
HFile HFile HFile
However, if old files are rarely read,
it's better to compact new first
HFile HFile HFile HFile
Looks like a good selection…
© Hortonworks Inc. 2011
Others in development, or considered
Architecting the Future of Big Data
•Large Object store (HBASE-7949)
•Partition files based on versions, timestamp, etc.
•LevelDB compactions (HBASE-7519)
•…more to come?
© Hortonworks Inc. 2011
Resources
•HBase book section contains a lot of details on tuning
the default selection
–http://hbase.apache.org/book.html#compaction
–There are other knobs that may be poorly documented
•JIRAs to track the work done for compactions
–https://issues.apache.org/jira/browse/HBASE/component/12319905
•Design and configuration documentation for the new
compactions are attached to JIRAs
–Tier-based: HBASE-7055, stripe: HBASE-7667
–Book will be updated as things make it into trunk
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Summary
•Compactions are a way to reduce the number of files to
read when getting data
•Compactions are expensive, so efficiency is important
•HBase 0.96 compactions
–contain automatic improvements to default algo
–are easier to improve, build upon, and configure
•Work in progress to improve compactions for Big Data
•Scenario-specific compaction algorithms are also
possible, and being worked on
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Q & A

More Related Content

What's hot

Training Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed cachingTraining Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed caching
OutSystems
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
VARUN SAXENA
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
enissoz
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
HBaseCon
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
Bryan Bende
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 
HBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasHBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasDataWorks Summit
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Knoldus Inc.
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon
 
Oracle GoldenGate
Oracle GoldenGate Oracle GoldenGate
Oracle GoldenGate
oracleonthebrain
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
IBM Integration Bus High Availability Overview
IBM Integration Bus High Availability OverviewIBM Integration Bus High Availability Overview
IBM Integration Bus High Availability Overview
Peter Broadhurst
 
Always on in SQL Server 2012
Always on in SQL Server 2012Always on in SQL Server 2012
Always on in SQL Server 2012
Fadi Abdulwahab
 
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeHadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Erik Krogen
 
Apache flink
Apache flinkApache flink
Apache flink
pranay kumar
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
Chicago Hadoop Users Group
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon
 
Managing your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariManaging your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache Ambari
DataWorks Summit
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
Karan Singh
 

What's hot (20)

Training Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed cachingTraining Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed caching
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
HBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasHBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region Replicas
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
Oracle GoldenGate
Oracle GoldenGate Oracle GoldenGate
Oracle GoldenGate
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
IBM Integration Bus High Availability Overview
IBM Integration Bus High Availability OverviewIBM Integration Bus High Availability Overview
IBM Integration Bus High Availability Overview
 
Always on in SQL Server 2012
Always on in SQL Server 2012Always on in SQL Server 2012
Always on in SQL Server 2012
 
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeHadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
 
Apache flink
Apache flinkApache flink
Apache flink
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
Managing your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariManaging your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache Ambari
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 

Similar to HBaseCon 2013: Compaction Improvements in Apache HBase

HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
Cloudera, Inc.
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
DataWorks Summit
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
DataWorks Summit
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesfeng1212
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
larsgeorge
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
DataWorks Summit
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
HBaseCon
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0
DataWorks Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
Nick Dimiduk
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale
Perforce
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
 

Similar to HBaseCon 2013: Compaction Improvements in Apache HBase (20)

HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

HBaseCon 2013: Compaction Improvements in Apache HBase

  • 1. © Hortonworks Inc. 2011 Compaction Improvements in Apache HBase Sergey Shelukhin sergey@hortonworks.com
  • 2. © Hortonworks Inc. 2011 About me •HBase committer since February 2013 •Member of Technical Staff at Hortonworks •Twitter @sershe84 Architecting the Future of Big Data
  • 3. © Hortonworks Inc. 2011 Overview •What are compactions? •Default algorithm and improvements •Enabling different implementations •Algorithms for various scenarios •Conclusions Architecting the Future of Big Data
  • 4. © Hortonworks Inc. 2011 What are compactions?
  • 5. © Hortonworks Inc. 2011 What are compactions? •HBase writes out immutable files as data is added –Each Store (CF+region) consists of these rowkey-ordered files –Immutable => more files accumulate over time –More files => slower reads •Compaction rewrites several files into one –Less files => faster reads • Major compaction rewrites all files in a Store into one –Can drop deleted records, tombstones and old versions •In minor compaction, files to compact are selected based on a heuristic Architecting the Future of Big Data
  • 6. © Hortonworks Inc. 2011 Compactions example Architecting the Future of Big Data •Memstore fills up, files are flushed •When enough files accumulate, they are compacted MemStore HDFS writes HFile … HFile HFile HFileHFile
  • 7. © Hortonworks Inc. 2011 Reads slow down w/o compactions •If too many files accumulate, reads slow down •Read latency over time without compactions: Architecting the Future of Big Data 0 5 10 15 20 25 0 3600 7200 10800 14400 Readlatency,ms. Load test time, sec
  • 8. © Hortonworks Inc. 2011 But, compaction cause slowdowns •Looks like lots of I/O for no apparent benefit •Example effect on reads (note better average) Architecting the Future of Big Data 0 5 10 15 20 25 0 3600 7200 10800 Readlatency,ms Load test time, sec
  • 9. © Hortonworks Inc. 2011 Default algorithm and improvements
  • 10. © Hortonworks Inc. 2011 Compaction tradeoffs •Hbase resolves key conflicts by file age –Therefore, can only compact contiguous files •Large compactions are more efficient (less total I/O) –However, they can cause long slowdown for clients •Small compactions have less effect on clients –However, in total you do more rewriting •We want to compact similar files Architecting the Future of Big Data
  • 11. © Hortonworks Inc. 2011 Default algorithm in 0.94 •Ratio-based selection –Look for files at most F times larger than the following files –Also allows limiting file numbers and sizes •Higher ratio => more aggressive (default 1.2) •Example: 2 files minimum, 3 maximum, ratio 1.2 Architecting the Future of Big Data HFile HFile HFile HFile HFile Too big!Too many files!OK. •Usually good for typical accumulation of flushed files •Not good for bulk load – unpredictable file sizes!
  • 12. © Hortonworks Inc. 2011 Off-peak compactions •Good if you have variable load through the day •HBASE-4463 - present in 0.94 (since 2011) •Compact more aggressively during certain hours of the day, when load is lower •Set off-peak period via – hbase.offpeak.start.hour,hbase.offpeak.end.hour (0-23) •Then, set ratio via – hbase.hstore.compaction.ratio.offpeak (default is 5) •Only one "off-peak" compaction at a time, so load is not totally prohibitive Architecting the Future of Big Data
  • 13. © Hortonworks Inc. 2011 Inefficiencies in default algorithm •First valid selection is chosen •Ratio is only considered for the first selected file –Thus, other files in compaction may not be similar •The solution found may not be the best one –especially for bulk load, with unpredictable file sizes Architecting the Future of Big Data HFile HFile HFile HFile HFile Matches the ratio, but this is a bad selection HFile
  • 14. © Hortonworks Inc. 2011 Exploring compaction selection •There are usually not so many files, so looking at all valid permutations and comparing quality is viable •HBASE-7842 - "exploring" compaction selection –Ratio checked for each file to choose good permutations –When store is ok, try to compact the most files –When store has too many files, try to eliminate some as fast as possible •On by default in 0.95/0.96 •Works with your old configuration settings Architecting the Future of Big Data
  • 15. © Hortonworks Inc. 2011 Examples and results •In previous example Architecting the Future of Big Data HFile HFile HFile HFile HFile Not in ratio, dissimilar files HFile •On bulk loads of random size, depending on settings: –loses only 0-10% efficiency in reducing files count; –While reducing I/O 3-10 times •Best results with ratio 1.3-1.4, 4 minimum files In ratio, may be valid… But this has more files!
  • 16. © Hortonworks Inc. 2011 Enabling different implementations
  • 17. © Hortonworks Inc. 2011 Making compactions pluggable •To allow further improvements, the code should be easy to replace; not the case as of 0.94 •Initial implementation – p/o HBASE-7055, HBASE-7516 – make just the selection pluggable •This is called "policy" (CompactionPolicy) •Example usages –exploring selection, mentioned previously –tier-based selection (port from Facebook) Architecting the Future of Big Data
  • 18. © Hortonworks Inc. 2011 Making compactions more pluggable • Other potential improvements are more involved • Need to change other things (HBASE-7678) • The meta-structure of the files (StoreFileManager, HBASE-7603) –Group files by some key/time/… based scheme –In memory/metadata only - filesystem structure or file format changes would be a compatibility nightmare –Example – LeveDB-style compactions, stripes • Compactor to compact the files (Compactor) –Example – large object store, levels, stripes • Can replace parts together or separately (StoreEngine) –E.g. level compactor only makes sense with level-aware store Architecting the Future of Big Data
  • 19. © Hortonworks Inc. 2011 Enabling compaction tuning •Different tables (or even column families) have different data and access patterns •Compactions already have large number of knobs •Starting with 0.96, they can be configured on table/CF level (HBASE-7236) •Example from the shell: alter 'table1', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', ... } Architecting the Future of Big Data
  • 20. © Hortonworks Inc. 2011 Algorithms for various scenarios
  • 21. © Hortonworks Inc. 2011 Key ways to improve compactions Architecting the Future of Big Data • Read from fewer files –Separate files by row key, version, time, etc. –Allows large number of files to be present, uncompacted • Don't compact the data you don't need to compact –For example, old data in OpenTSDB-like systems –Obviously, results in less I/O • Make compactions smaller –Without too much I/O amplification or too many files –Results in less compaction-related outages • HBase works better with few large regions; however, large compactions cause unavailability
  • 22. © Hortonworks Inc. 2011 How to avoid large compactions Architecting the Future of Big Data •LevelDB compactions –Files live on multiple levels –Files on each level have non-overlapping row-key ranges –…except level 0 (L0), where memstore flushes go –Compact overlapping subsets of 2 level, data goes up a level –Most read requests need only one file per level, plus all of L0 •Small compactions, few files per read, however... –More I/O, as the data moves from level to level –No major compactions – dropping deletes is not trivial –Messes up file ordering due to file boundary overlaps between levels – not readable correctly by default store
  • 23. © Hortonworks Inc. 2011 Stripe compactions (HBASE-7667) Architecting the Future of Big Data • Somewhat like LevelDB, partition the keys inside each region/store • But, only 1 level (plus optional L0) • Compared to regions, partitioning is more flexible –The default is a number of ~equal-sized stripes • To read, just read relevant stripes + L0, if present HFile HFile Region start key: ccc eee Row-key axis iii: region end keyggg H HFileHFileHFile HFile L0 get 'hbase'
  • 24. © Hortonworks Inc. 2011 Stripe compactions – writes Architecting the Future of Big Data •Data flushed from MemStore into several files •Each stripe compacts separately most of the time MemStore HDFS HFile HFile H HFileHFileHFile H H H HFile
  • 25. © Hortonworks Inc. 2011 Stripe compactions – other Architecting the Future of Big Data •Why L0? –Bulk loaded files go to L0 –Flushes can also go into single L0 files (to avoid tiny files) –Several L0 files are then compacted into striped files •Can drop deletes if compacting one entire stripe +L0 –No need for major compactions, ever •Compact 2 stripes together – rebalance if unbalanced –Very rare, however - unbalanced stripes are not a huge deal • Boundaries could be used to improve region splits in future
  • 26. © Hortonworks Inc. 2011 Stripe compactions - performance Architecting the Future of Big Data •EC2, c1.xlarge, preload; then measure random read perf –LoadTestTool + deletes + overwrites; measure random reads 0 500 1000 1500 2000 2500 3500 4500 5500 6500 7500 8500 Randomgetspersecond Test time, sec. Default gets-per-second, 30sec. MA Stripe gets-per-second, 30sec. MA
  • 27. © Hortonworks Inc. 2011 Stripe compactions - performance Architecting the Future of Big Data • On individual request level: median latency – same (1.6ms) • However 90th pct - 15% improvement (~13ms to ~11ms), • 99th pct – 20% improvement (~60 to ~47ms) • While also sending ~18% more reads in ~4% less time 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 14 16 18 20 Latency (ms) CDF Default Stripes (12)
  • 28. © Hortonworks Inc. 2011 Other stripe boundary schemes •For sharded sequential keys (like OpenTSDB), compacting old data again and again is not useful •What if stripes split dynamically as they grow? –If data is sequential, only a subset of stripes will grow –Non-growing stripes never need to be compacted Architecting the Future of Big Data HFileHFile HFile HFile H H HFile HFile HFile H Rowkey space Too big! HFile H Now this will hardly ever compact
  • 29. © Hortonworks Inc. 2011 Others in development – tier-based Architecting the Future of Big Data •Tier-based compaction selection (HBASE-7055; originally developed in Facebook) –Old data may not be read as frequently, new data may all be in cache so # of files does not matter, etc. –So, during selection, dynamically arrange files into tiers, and apply different rules (ratios, etc.) to them •Simple example (only 2 tiers) HFile HFile HFile However, if old files are rarely read, it's better to compact new first HFile HFile HFile HFile Looks like a good selection…
  • 30. © Hortonworks Inc. 2011 Others in development, or considered Architecting the Future of Big Data •Large Object store (HBASE-7949) •Partition files based on versions, timestamp, etc. •LevelDB compactions (HBASE-7519) •…more to come?
  • 31. © Hortonworks Inc. 2011 Resources •HBase book section contains a lot of details on tuning the default selection –http://hbase.apache.org/book.html#compaction –There are other knobs that may be poorly documented •JIRAs to track the work done for compactions –https://issues.apache.org/jira/browse/HBASE/component/12319905 •Design and configuration documentation for the new compactions are attached to JIRAs –Tier-based: HBASE-7055, stripe: HBASE-7667 –Book will be updated as things make it into trunk Architecting the Future of Big Data
  • 32. © Hortonworks Inc. 2011 Summary •Compactions are a way to reduce the number of files to read when getting data •Compactions are expensive, so efficiency is important •HBase 0.96 compactions –contain automatic improvements to default algo –are easier to improve, build upon, and configure •Work in progress to improve compactions for Big Data •Scenario-specific compaction algorithms are also possible, and being worked on Architecting the Future of Big Data
  • 33. © Hortonworks Inc. 2011 Q & A

Editor's Notes

  1. Example of CF delete processing