• Save
Hive integration: HBase and Rcfile__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Hive integration: HBase and Rcfile__HadoopSummit2010

  • 5,889 views
Uploaded on

Hadoop Summit 2010 - Developers Track ...

Hadoop Summit 2010 - Developers Track
Hive integration: HBase and Rcfile
John Sichi and Yongqiang He, Facebook

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
5,889
On Slideshare
5,878
From Embeds
11
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
15

Embeds 11

http://www.techgig.com 7
http://www.techgig.timesjobs.com 4

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.

Transcript

  • 1. Hive Integration: HBase and RCFile
    • John Sichi and Yongqiang He
    Facebook
  • 2.
    • HBase Integration (John Sichi)
    • RCFile Integration (Yongqiang He)
    Session Agenda
  • 3. HBase: Facebook Warehouse Use Case
    • Reduce latency on dimension data availability
    HBase (Dimension data) Partitioned RCFiles (Fact data) Periodic Load Continuous Update Hive Queries
  • 4. HBase: Storage Handler
    • CREATE TABLE users(
    • userid int, name string, email string, notes string)
    • STORED BY
    • 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    • WITH SERDEPROPERTIES (
    • “ hbase.columns.mapping” =
    • “ small:name,small:email,large:notes”)
    • TBLPROPERTIES (
    • “ hbase.table.name” = “user_list”
    • );
    • INSERT, SELECT, JOIN, GROUP BY, UNION etc
  • 5.
    • Testing at scale
      • 20-node test cluster
      • Bulk-loaded 6TB of gzip-compressed data from Hive into Hbase in about 30 hours
      • Incremental-loaded from Hive into Hbase at 30GB/hr (with write-ahead logging disabled)
      • Full-table scan queries: currently 5x slower than against native Hive tables (no tuning or optimization yet)
    HBase: Integration Status
  • 6.
    • Retest against HBase trunk with larger (30TB) data
    • Try out new features for accelerating incremental load
      • Bulk load into table with existing data
      • Multiputs
      • Deferred logging
    • Support for “virtual partitions” based on timestamps
    • Support for deletion
    • Push down filters
    • Index join? Optimize scans?
    HBase: Integration Roadmap
  • 7.
    • Why Columnar Storages
      • Better Compression
        • Light weight compression
        • RLE
        • Bit-map
        • Etc
      • CPU, Memory, Storage
      • Columnar Operator
        • Cache conscious (MonetDB)
    RCFile
  • 8.
    • Why RCFile
      • Huge Data
      • Reduce data storage space required
      • Ad-hoc workloads
      • Storage space vs. speed (data performance)
      • Can we get both with no application changes?
        • Reduce storage spaces
        • Accelerate performance for arbitrary applications
    RCFile
  • 9.
      • Pros
      • Work with Column Pruning
        • Only touch needed columns at runtime
        • Lazy decompression
          • Select col1, col2 from tbl_col_10 where col_1 > 30
          • Will only touch col1 and col2
          • Col2 is decompressed only when a block contains a col1 value greater than 30
    RCFile
  • 10.
    • Cons
      • Row Construction
        • Is the main overhead
        • Each column’s data is stored separately, and may be sorted in different order
        • In memory operation for rcfile
        • This could be really painful; a lot of room to improve here
    RCFile
  • 11.
    • Facebook Deployment
      • Default file format in Facebook cluster
      • 20% space savings on average
      • We are transforming old data to the new format
    RCFile
  • 12.
    • Future work
      • Support built in indexing
        • Like bloom filter etc
      • more cache conscious columnar operators
      • Pushing predicate to file reader
    RCFile
  • 13. Questions?
    • [email_address]
    • [email_address]