Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hive integration: HBase and Rcfile__HadoopSummit2010


Published on

Hadoop Summit 2010 - Developers Track
Hive integration: HBase and Rcfile
John Sichi and Yongqiang He, Facebook

Published in: Technology
  • Be the first to comment

Hive integration: HBase and Rcfile__HadoopSummit2010

  1. 1. Hive Integration: HBase and RCFile <ul><li>John Sichi and Yongqiang He </li></ul>Facebook
  2. 2. <ul><li>HBase Integration (John Sichi) </li></ul><ul><li>RCFile Integration (Yongqiang He) </li></ul>Session Agenda
  3. 3. HBase: Facebook Warehouse Use Case <ul><li>Reduce latency on dimension data availability </li></ul>HBase (Dimension data) Partitioned RCFiles (Fact data) Periodic Load Continuous Update Hive Queries
  4. 4. HBase: Storage Handler <ul><li>CREATE TABLE users( </li></ul><ul><li>userid int, name string, email string, notes string) </li></ul><ul><li>STORED BY </li></ul><ul><li>'org.apache.hadoop.hive.hbase.HBaseStorageHandler' </li></ul><ul><li>WITH SERDEPROPERTIES ( </li></ul><ul><li>“ hbase.columns.mapping” = </li></ul><ul><li>“ small:name,small:email,large:notes”) </li></ul><ul><li>TBLPROPERTIES ( </li></ul><ul><li>“” = “user_list” </li></ul><ul><li>); </li></ul><ul><li>INSERT, SELECT, JOIN, GROUP BY, UNION etc </li></ul>
  5. 5. <ul><li>Testing at scale </li></ul><ul><ul><li>20-node test cluster </li></ul></ul><ul><ul><li>Bulk-loaded 6TB of gzip-compressed data from Hive into Hbase in about 30 hours </li></ul></ul><ul><ul><li>Incremental-loaded from Hive into Hbase at 30GB/hr (with write-ahead logging disabled) </li></ul></ul><ul><ul><li>Full-table scan queries: currently 5x slower than against native Hive tables (no tuning or optimization yet) </li></ul></ul>HBase: Integration Status
  6. 6. <ul><li>Retest against HBase trunk with larger (30TB) data </li></ul><ul><li>Try out new features for accelerating incremental load </li></ul><ul><ul><li>Bulk load into table with existing data </li></ul></ul><ul><ul><li>Multiputs </li></ul></ul><ul><ul><li>Deferred logging </li></ul></ul><ul><li>Support for “virtual partitions” based on timestamps </li></ul><ul><li>Support for deletion </li></ul><ul><li>Push down filters </li></ul><ul><li>Index join? Optimize scans? </li></ul>HBase: Integration Roadmap
  7. 7. <ul><li>Why Columnar Storages </li></ul><ul><ul><li>Better Compression </li></ul></ul><ul><ul><ul><li>Light weight compression </li></ul></ul></ul><ul><ul><ul><li>RLE </li></ul></ul></ul><ul><ul><ul><li>Bit-map </li></ul></ul></ul><ul><ul><ul><li>Etc </li></ul></ul></ul><ul><ul><li>CPU, Memory, Storage </li></ul></ul><ul><ul><li>Columnar Operator </li></ul></ul><ul><ul><ul><li>Cache conscious (MonetDB) </li></ul></ul></ul>RCFile
  8. 8. <ul><li>Why RCFile </li></ul><ul><ul><li>Huge Data </li></ul></ul><ul><ul><li>Reduce data storage space required </li></ul></ul><ul><ul><li>Ad-hoc workloads </li></ul></ul><ul><ul><li>Storage space vs. speed (data performance) </li></ul></ul><ul><ul><li>Can we get both with no application changes? </li></ul></ul><ul><ul><ul><li>Reduce storage spaces </li></ul></ul></ul><ul><ul><ul><li>Accelerate performance for arbitrary applications </li></ul></ul></ul>RCFile
  9. 9. <ul><ul><li>Pros </li></ul></ul><ul><ul><li>Work with Column Pruning </li></ul></ul><ul><ul><ul><li>Only touch needed columns at runtime </li></ul></ul></ul><ul><ul><ul><li>Lazy decompression </li></ul></ul></ul><ul><ul><ul><ul><li>Select col1, col2 from tbl_col_10 where col_1 > 30 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Will only touch col1 and col2 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Col2 is decompressed only when a block contains a col1 value greater than 30 </li></ul></ul></ul></ul>RCFile
  10. 10. <ul><li>Cons </li></ul><ul><ul><li>Row Construction </li></ul></ul><ul><ul><ul><li>Is the main overhead </li></ul></ul></ul><ul><ul><ul><li>Each column’s data is stored separately, and may be sorted in different order </li></ul></ul></ul><ul><ul><ul><li>In memory operation for rcfile </li></ul></ul></ul><ul><ul><ul><li>This could be really painful; a lot of room to improve here </li></ul></ul></ul>RCFile
  11. 11. <ul><li>Facebook Deployment </li></ul><ul><ul><li>Default file format in Facebook cluster </li></ul></ul><ul><ul><li>20% space savings on average </li></ul></ul><ul><ul><li>We are transforming old data to the new format </li></ul></ul>RCFile
  12. 12. <ul><li>Future work </li></ul><ul><ul><li>Support built in indexing </li></ul></ul><ul><ul><ul><li>Like bloom filter etc </li></ul></ul></ul><ul><ul><li>more cache conscious columnar operators </li></ul></ul><ul><ul><li>Pushing predicate to file reader </li></ul></ul>RCFile
  13. 13. Questions? <ul><li>[email_address] </li></ul><ul><li>[email_address] </li></ul>