• Save
Hive integration: HBase and Rcfile__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×
 

Hive integration: HBase and Rcfile__HadoopSummit2010

on

  • 5,791 views

Hadoop Summit 2010 - Developers Track

Hadoop Summit 2010 - Developers Track
Hive integration: HBase and Rcfile
John Sichi and Yongqiang He, Facebook

Statistics

Views

Total Views
5,791
Views on SlideShare
5,780
Embed Views
11

Actions

Likes
15
Downloads
0
Comments
0

2 Embeds 11

http://www.techgig.com 7
http://www.techgig.timesjobs.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.

Hive integration: HBase and Rcfile__HadoopSummit2010 Hive integration: HBase and Rcfile__HadoopSummit2010 Presentation Transcript

  • Hive Integration: HBase and RCFile
    • John Sichi and Yongqiang He
    Facebook
    • HBase Integration (John Sichi)
    • RCFile Integration (Yongqiang He)
    Session Agenda
  • HBase: Facebook Warehouse Use Case
    • Reduce latency on dimension data availability
    HBase (Dimension data) Partitioned RCFiles (Fact data) Periodic Load Continuous Update Hive Queries
  • HBase: Storage Handler
    • CREATE TABLE users(
    • userid int, name string, email string, notes string)
    • STORED BY
    • 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    • WITH SERDEPROPERTIES (
    • “ hbase.columns.mapping” =
    • “ small:name,small:email,large:notes”)
    • TBLPROPERTIES (
    • “ hbase.table.name” = “user_list”
    • );
    • INSERT, SELECT, JOIN, GROUP BY, UNION etc
    • Testing at scale
      • 20-node test cluster
      • Bulk-loaded 6TB of gzip-compressed data from Hive into Hbase in about 30 hours
      • Incremental-loaded from Hive into Hbase at 30GB/hr (with write-ahead logging disabled)
      • Full-table scan queries: currently 5x slower than against native Hive tables (no tuning or optimization yet)
    HBase: Integration Status
    • Retest against HBase trunk with larger (30TB) data
    • Try out new features for accelerating incremental load
      • Bulk load into table with existing data
      • Multiputs
      • Deferred logging
    • Support for “virtual partitions” based on timestamps
    • Support for deletion
    • Push down filters
    • Index join? Optimize scans?
    HBase: Integration Roadmap
    • Why Columnar Storages
      • Better Compression
        • Light weight compression
        • RLE
        • Bit-map
        • Etc
      • CPU, Memory, Storage
      • Columnar Operator
        • Cache conscious (MonetDB)
    RCFile
    • Why RCFile
      • Huge Data
      • Reduce data storage space required
      • Ad-hoc workloads
      • Storage space vs. speed (data performance)
      • Can we get both with no application changes?
        • Reduce storage spaces
        • Accelerate performance for arbitrary applications
    RCFile
      • Pros
      • Work with Column Pruning
        • Only touch needed columns at runtime
        • Lazy decompression
          • Select col1, col2 from tbl_col_10 where col_1 > 30
          • Will only touch col1 and col2
          • Col2 is decompressed only when a block contains a col1 value greater than 30
    RCFile
    • Cons
      • Row Construction
        • Is the main overhead
        • Each column’s data is stored separately, and may be sorted in different order
        • In memory operation for rcfile
        • This could be really painful; a lot of room to improve here
    RCFile
    • Facebook Deployment
      • Default file format in Facebook cluster
      • 20% space savings on average
      • We are transforming old data to the new format
    RCFile
    • Future work
      • Support built in indexing
        • Like bloom filter etc
      • more cache conscious columnar operators
      • Pushing predicate to file reader
    RCFile
  • Questions?
    • [email_address]
    • [email_address]