Hive Integration:  HBase and RCFile <ul><li>John Sichi and Yongqiang He </li></ul>Facebook
<ul><li>HBase Integration (John Sichi) </li></ul><ul><li>RCFile Integration (Yongqiang He) </li></ul>Session Agenda
HBase:  Facebook Warehouse Use Case <ul><li>Reduce latency on dimension data availability </li></ul>HBase (Dimension data)...
HBase:  Storage Handler <ul><li>CREATE TABLE users( </li></ul><ul><li>userid int, name string, email string, notes string)...
<ul><li>Testing at scale </li></ul><ul><ul><li>20-node test cluster </li></ul></ul><ul><ul><li>Bulk-loaded 6TB of gzip-com...
<ul><li>Retest against HBase trunk with larger (30TB) data </li></ul><ul><li>Try out new features for accelerating increme...
<ul><li>Why Columnar Storages </li></ul><ul><ul><li>Better Compression  </li></ul></ul><ul><ul><ul><li>Light weight compre...
<ul><li>Why RCFile </li></ul><ul><ul><li>Huge Data </li></ul></ul><ul><ul><li>Reduce data storage space required </li></ul...
<ul><ul><li>Pros </li></ul></ul><ul><ul><li>Work with Column Pruning </li></ul></ul><ul><ul><ul><li>Only touch needed colu...
<ul><li>Cons </li></ul><ul><ul><li>Row Construction </li></ul></ul><ul><ul><ul><li>Is the main overhead </li></ul></ul></u...
<ul><li>Facebook Deployment </li></ul><ul><ul><li>Default file format in Facebook cluster </li></ul></ul><ul><ul><li>20% s...
<ul><li>Future work </li></ul><ul><ul><li>Support built in indexing </li></ul></ul><ul><ul><ul><li>Like bloom filter etc <...
Questions? <ul><li>[email_address] </li></ul><ul><li>[email_address] </li></ul>
Upcoming SlideShare
Loading in...5
×

Hive integration: HBase and Rcfile__HadoopSummit2010

4,551

Published on

Hadoop Summit 2010 - Developers Track
Hive integration: HBase and Rcfile
John Sichi and Yongqiang He, Facebook

Published in: Technology
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,551
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Hive integration: HBase and Rcfile__HadoopSummit2010

    1. 1. Hive Integration: HBase and RCFile <ul><li>John Sichi and Yongqiang He </li></ul>Facebook
    2. 2. <ul><li>HBase Integration (John Sichi) </li></ul><ul><li>RCFile Integration (Yongqiang He) </li></ul>Session Agenda
    3. 3. HBase: Facebook Warehouse Use Case <ul><li>Reduce latency on dimension data availability </li></ul>HBase (Dimension data) Partitioned RCFiles (Fact data) Periodic Load Continuous Update Hive Queries
    4. 4. HBase: Storage Handler <ul><li>CREATE TABLE users( </li></ul><ul><li>userid int, name string, email string, notes string) </li></ul><ul><li>STORED BY </li></ul><ul><li>'org.apache.hadoop.hive.hbase.HBaseStorageHandler' </li></ul><ul><li>WITH SERDEPROPERTIES ( </li></ul><ul><li>“ hbase.columns.mapping” = </li></ul><ul><li>“ small:name,small:email,large:notes”) </li></ul><ul><li>TBLPROPERTIES ( </li></ul><ul><li>“ hbase.table.name” = “user_list” </li></ul><ul><li>); </li></ul><ul><li>INSERT, SELECT, JOIN, GROUP BY, UNION etc </li></ul>
    5. 5. <ul><li>Testing at scale </li></ul><ul><ul><li>20-node test cluster </li></ul></ul><ul><ul><li>Bulk-loaded 6TB of gzip-compressed data from Hive into Hbase in about 30 hours </li></ul></ul><ul><ul><li>Incremental-loaded from Hive into Hbase at 30GB/hr (with write-ahead logging disabled) </li></ul></ul><ul><ul><li>Full-table scan queries: currently 5x slower than against native Hive tables (no tuning or optimization yet) </li></ul></ul>HBase: Integration Status
    6. 6. <ul><li>Retest against HBase trunk with larger (30TB) data </li></ul><ul><li>Try out new features for accelerating incremental load </li></ul><ul><ul><li>Bulk load into table with existing data </li></ul></ul><ul><ul><li>Multiputs </li></ul></ul><ul><ul><li>Deferred logging </li></ul></ul><ul><li>Support for “virtual partitions” based on timestamps </li></ul><ul><li>Support for deletion </li></ul><ul><li>Push down filters </li></ul><ul><li>Index join? Optimize scans? </li></ul>HBase: Integration Roadmap
    7. 7. <ul><li>Why Columnar Storages </li></ul><ul><ul><li>Better Compression </li></ul></ul><ul><ul><ul><li>Light weight compression </li></ul></ul></ul><ul><ul><ul><li>RLE </li></ul></ul></ul><ul><ul><ul><li>Bit-map </li></ul></ul></ul><ul><ul><ul><li>Etc </li></ul></ul></ul><ul><ul><li>CPU, Memory, Storage </li></ul></ul><ul><ul><li>Columnar Operator </li></ul></ul><ul><ul><ul><li>Cache conscious (MonetDB) </li></ul></ul></ul>RCFile
    8. 8. <ul><li>Why RCFile </li></ul><ul><ul><li>Huge Data </li></ul></ul><ul><ul><li>Reduce data storage space required </li></ul></ul><ul><ul><li>Ad-hoc workloads </li></ul></ul><ul><ul><li>Storage space vs. speed (data performance) </li></ul></ul><ul><ul><li>Can we get both with no application changes? </li></ul></ul><ul><ul><ul><li>Reduce storage spaces </li></ul></ul></ul><ul><ul><ul><li>Accelerate performance for arbitrary applications </li></ul></ul></ul>RCFile
    9. 9. <ul><ul><li>Pros </li></ul></ul><ul><ul><li>Work with Column Pruning </li></ul></ul><ul><ul><ul><li>Only touch needed columns at runtime </li></ul></ul></ul><ul><ul><ul><li>Lazy decompression </li></ul></ul></ul><ul><ul><ul><ul><li>Select col1, col2 from tbl_col_10 where col_1 > 30 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Will only touch col1 and col2 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Col2 is decompressed only when a block contains a col1 value greater than 30 </li></ul></ul></ul></ul>RCFile
    10. 10. <ul><li>Cons </li></ul><ul><ul><li>Row Construction </li></ul></ul><ul><ul><ul><li>Is the main overhead </li></ul></ul></ul><ul><ul><ul><li>Each column’s data is stored separately, and may be sorted in different order </li></ul></ul></ul><ul><ul><ul><li>In memory operation for rcfile </li></ul></ul></ul><ul><ul><ul><li>This could be really painful; a lot of room to improve here </li></ul></ul></ul>RCFile
    11. 11. <ul><li>Facebook Deployment </li></ul><ul><ul><li>Default file format in Facebook cluster </li></ul></ul><ul><ul><li>20% space savings on average </li></ul></ul><ul><ul><li>We are transforming old data to the new format </li></ul></ul>RCFile
    12. 12. <ul><li>Future work </li></ul><ul><ul><li>Support built in indexing </li></ul></ul><ul><ul><ul><li>Like bloom filter etc </li></ul></ul></ul><ul><ul><li>more cache conscious columnar operators </li></ul></ul><ul><ul><li>Pushing predicate to file reader </li></ul></ul>RCFile
    13. 13. Questions? <ul><li>[email_address] </li></ul><ul><li>[email_address] </li></ul>

    ×