Your SlideShare is downloading. ×
0
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hive integration: HBase and Rcfile__HadoopSummit2010

4,481

Published on

Hadoop Summit 2010 - Developers Track …

Hadoop Summit 2010 - Developers Track
Hive integration: HBase and Rcfile
John Sichi and Yongqiang He, Facebook

Published in: Technology
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,481
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
16
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Transcript

    • 1. Hive Integration: HBase and RCFile <ul><li>John Sichi and Yongqiang He </li></ul>Facebook
    • 2. <ul><li>HBase Integration (John Sichi) </li></ul><ul><li>RCFile Integration (Yongqiang He) </li></ul>Session Agenda
    • 3. HBase: Facebook Warehouse Use Case <ul><li>Reduce latency on dimension data availability </li></ul>HBase (Dimension data) Partitioned RCFiles (Fact data) Periodic Load Continuous Update Hive Queries
    • 4. HBase: Storage Handler <ul><li>CREATE TABLE users( </li></ul><ul><li>userid int, name string, email string, notes string) </li></ul><ul><li>STORED BY </li></ul><ul><li>'org.apache.hadoop.hive.hbase.HBaseStorageHandler' </li></ul><ul><li>WITH SERDEPROPERTIES ( </li></ul><ul><li>“ hbase.columns.mapping” = </li></ul><ul><li>“ small:name,small:email,large:notes”) </li></ul><ul><li>TBLPROPERTIES ( </li></ul><ul><li>“ hbase.table.name” = “user_list” </li></ul><ul><li>); </li></ul><ul><li>INSERT, SELECT, JOIN, GROUP BY, UNION etc </li></ul>
    • 5. <ul><li>Testing at scale </li></ul><ul><ul><li>20-node test cluster </li></ul></ul><ul><ul><li>Bulk-loaded 6TB of gzip-compressed data from Hive into Hbase in about 30 hours </li></ul></ul><ul><ul><li>Incremental-loaded from Hive into Hbase at 30GB/hr (with write-ahead logging disabled) </li></ul></ul><ul><ul><li>Full-table scan queries: currently 5x slower than against native Hive tables (no tuning or optimization yet) </li></ul></ul>HBase: Integration Status
    • 6. <ul><li>Retest against HBase trunk with larger (30TB) data </li></ul><ul><li>Try out new features for accelerating incremental load </li></ul><ul><ul><li>Bulk load into table with existing data </li></ul></ul><ul><ul><li>Multiputs </li></ul></ul><ul><ul><li>Deferred logging </li></ul></ul><ul><li>Support for “virtual partitions” based on timestamps </li></ul><ul><li>Support for deletion </li></ul><ul><li>Push down filters </li></ul><ul><li>Index join? Optimize scans? </li></ul>HBase: Integration Roadmap
    • 7. <ul><li>Why Columnar Storages </li></ul><ul><ul><li>Better Compression </li></ul></ul><ul><ul><ul><li>Light weight compression </li></ul></ul></ul><ul><ul><ul><li>RLE </li></ul></ul></ul><ul><ul><ul><li>Bit-map </li></ul></ul></ul><ul><ul><ul><li>Etc </li></ul></ul></ul><ul><ul><li>CPU, Memory, Storage </li></ul></ul><ul><ul><li>Columnar Operator </li></ul></ul><ul><ul><ul><li>Cache conscious (MonetDB) </li></ul></ul></ul>RCFile
    • 8. <ul><li>Why RCFile </li></ul><ul><ul><li>Huge Data </li></ul></ul><ul><ul><li>Reduce data storage space required </li></ul></ul><ul><ul><li>Ad-hoc workloads </li></ul></ul><ul><ul><li>Storage space vs. speed (data performance) </li></ul></ul><ul><ul><li>Can we get both with no application changes? </li></ul></ul><ul><ul><ul><li>Reduce storage spaces </li></ul></ul></ul><ul><ul><ul><li>Accelerate performance for arbitrary applications </li></ul></ul></ul>RCFile
    • 9. <ul><ul><li>Pros </li></ul></ul><ul><ul><li>Work with Column Pruning </li></ul></ul><ul><ul><ul><li>Only touch needed columns at runtime </li></ul></ul></ul><ul><ul><ul><li>Lazy decompression </li></ul></ul></ul><ul><ul><ul><ul><li>Select col1, col2 from tbl_col_10 where col_1 > 30 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Will only touch col1 and col2 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Col2 is decompressed only when a block contains a col1 value greater than 30 </li></ul></ul></ul></ul>RCFile
    • 10. <ul><li>Cons </li></ul><ul><ul><li>Row Construction </li></ul></ul><ul><ul><ul><li>Is the main overhead </li></ul></ul></ul><ul><ul><ul><li>Each column’s data is stored separately, and may be sorted in different order </li></ul></ul></ul><ul><ul><ul><li>In memory operation for rcfile </li></ul></ul></ul><ul><ul><ul><li>This could be really painful; a lot of room to improve here </li></ul></ul></ul>RCFile
    • 11. <ul><li>Facebook Deployment </li></ul><ul><ul><li>Default file format in Facebook cluster </li></ul></ul><ul><ul><li>20% space savings on average </li></ul></ul><ul><ul><li>We are transforming old data to the new format </li></ul></ul>RCFile
    • 12. <ul><li>Future work </li></ul><ul><ul><li>Support built in indexing </li></ul></ul><ul><ul><ul><li>Like bloom filter etc </li></ul></ul></ul><ul><ul><li>more cache conscious columnar operators </li></ul></ul><ul><ul><li>Pushing predicate to file reader </li></ul></ul>RCFile
    • 13. Questions? <ul><li>[email_address] </li></ul><ul><li>[email_address] </li></ul>

    ×