Your SlideShare is downloading. ×
0
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hive integration: HBase and Rcfile__HadoopSummit2010

4,477

Published on

Hadoop Summit 2010 - Developers Track …

Hadoop Summit 2010 - Developers Track
Hive integration: HBase and Rcfile
John Sichi and Yongqiang He, Facebook

Published in: Technology
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,477
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
16
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Transcript

    • 1. Hive Integration: HBase and RCFile
      • John Sichi and Yongqiang He
      Facebook
    • 2.
      • HBase Integration (John Sichi)
      • RCFile Integration (Yongqiang He)
      Session Agenda
    • 3. HBase: Facebook Warehouse Use Case
      • Reduce latency on dimension data availability
      HBase (Dimension data) Partitioned RCFiles (Fact data) Periodic Load Continuous Update Hive Queries
    • 4. HBase: Storage Handler
      • CREATE TABLE users(
      • userid int, name string, email string, notes string)
      • STORED BY
      • 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      • WITH SERDEPROPERTIES (
      • “ hbase.columns.mapping” =
      • “ small:name,small:email,large:notes”)
      • TBLPROPERTIES (
      • “ hbase.table.name” = “user_list”
      • );
      • INSERT, SELECT, JOIN, GROUP BY, UNION etc
    • 5.
      • Testing at scale
        • 20-node test cluster
        • Bulk-loaded 6TB of gzip-compressed data from Hive into Hbase in about 30 hours
        • Incremental-loaded from Hive into Hbase at 30GB/hr (with write-ahead logging disabled)
        • Full-table scan queries: currently 5x slower than against native Hive tables (no tuning or optimization yet)
      HBase: Integration Status
    • 6.
      • Retest against HBase trunk with larger (30TB) data
      • Try out new features for accelerating incremental load
        • Bulk load into table with existing data
        • Multiputs
        • Deferred logging
      • Support for “virtual partitions” based on timestamps
      • Support for deletion
      • Push down filters
      • Index join? Optimize scans?
      HBase: Integration Roadmap
    • 7.
      • Why Columnar Storages
        • Better Compression
          • Light weight compression
          • RLE
          • Bit-map
          • Etc
        • CPU, Memory, Storage
        • Columnar Operator
          • Cache conscious (MonetDB)
      RCFile
    • 8.
      • Why RCFile
        • Huge Data
        • Reduce data storage space required
        • Ad-hoc workloads
        • Storage space vs. speed (data performance)
        • Can we get both with no application changes?
          • Reduce storage spaces
          • Accelerate performance for arbitrary applications
      RCFile
    • 9.
        • Pros
        • Work with Column Pruning
          • Only touch needed columns at runtime
          • Lazy decompression
            • Select col1, col2 from tbl_col_10 where col_1 > 30
            • Will only touch col1 and col2
            • Col2 is decompressed only when a block contains a col1 value greater than 30
      RCFile
    • 10.
      • Cons
        • Row Construction
          • Is the main overhead
          • Each column’s data is stored separately, and may be sorted in different order
          • In memory operation for rcfile
          • This could be really painful; a lot of room to improve here
      RCFile
    • 11.
      • Facebook Deployment
        • Default file format in Facebook cluster
        • 20% space savings on average
        • We are transforming old data to the new format
      RCFile
    • 12.
      • Future work
        • Support built in indexing
          • Like bloom filter etc
        • more cache conscious columnar operators
        • Pushing predicate to file reader
      RCFile
    • 13. Questions?
      • [email_address]
      • [email_address]

    ×