• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop, Hbase and Hive- Bay area Hadoop User Group

Hadoop, Hbase and Hive- Bay area Hadoop User Group






Total Views
Views on SlideShare
Embed Views



25 Embeds 1,467

http://tedwon.com 400
http://developer.yahoo.net 316
http://www.slideshare.net 179
http://www.readwriteweb.com 172
http://developer.yahoo.com 172
http://readwriteweb.com.br 84
http://webholic.com.br 39
https://developer.yahoo.com 21
http://paper.li 21
http://readwrite.com 15
http://www.party09.com 10
http://devpub.kr 8
http://static.slidesharecdn.com 8
http://blog.iband.kr 7
http://traackit.blogspot.sg 3
http://traackit.blogspot.kr 2 2
http://party09.com 1
http://traackit.blogspot.fr 1
http://www.cnliam.com 1
http://rss2.com 1
http://a0.twimg.com 1
http://www.hanrss.com 1
http://cache.baidu.com 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


13 of 3 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Hadoop, Hbase and Hive- Bay area Hadoop User Group Hadoop, Hbase and Hive- Bay area Hadoop User Group Presentation Transcript

    • Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook +
    • Agenda
      • Use Cases
      • Architecture
      • Storage Handler
      • Load via INSERT
      • Query Processing
      • Bulk Load
      • Q & A
    • Motivations
      • Data, data, and more data
        • 200 GB/day in March 2008 -> 12+ TB/day at the end of 2009
        • About 8x increase per year
      • Queries, queries, and more queries
        • More than 200 unique users querying per day
        • 7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queries
      • They want it all and they want it now
        • Users expect faster response time on fresher data
        • Sampled subsets aren’t always good enough
    • How Can HBase Help?
      • Replicate dimension tables from transactional databases with low latency and without sharding
        • (Fact data can stay in Hive since it is append-only)
      • Only move changed rows
        • “ Full scrape” is too slow and doesn’t scale as data keeps growing
        • Hive by itself is not good at row-level operations
      • Integrate into Hive’s map/reduce query execution plans for full parallel distributed processing
      • Multiversioning for snapshot consistency?
    • Use Case 1: HBase As ETL Data Target Facebook HBase Hive INSERT … SELECT … Source Files/Tables
    • Use Case 2: HBase As Data Source Facebook HBase Other Files/Tables Hive SELECT … JOIN … GROUP BY … Query Result
    • Use Case 3: Low Latency Warehouse Facebook HBase Other Files/Tables Periodic Load Continuous Update Hive Queries
    • HBase Architecture Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
    • Hive Architecture Facebook
    • All Together Now! Facebook
    • Hive CLI With HBase
      • Minimum configuration needed:
      • hive
      • --auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar
      • -hiveconf hbase.zookeeper.quorum=zk1,zk2…
      • hive> create table …
    • Storage Handler
      • CREATE TABLE users(
      • userid int, name string, email string, notes string)
      • STORED BY
      • 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      • “ hbase.columns.mapping” =
      • “ small:name,small:email,large:notes”)
      • “ hbase.table.name” = “user_list”
      • );
    • Column Mapping
      • First column in table is always the row key
      • Other columns can be mapped to either:
        • An HBase column (any Hive type)
        • An HBase column family (must be MAP type in Hive)
      • Multiple Hive columns can map to the same HBase column or family
      • Limitations
        • Currently no control over type mapping (always string in HBase)
        • Currently no way to map HBase timestamp attribute
    • Load Via INSERT
      • SELECT * FROM …;
      • Hive task writes rows to HBase via org.apache.hadoop.hbase.mapred.TableOutputFormat
      • HBaseSerDe serializes rows into BatchUpdate objects (currently all values are converted to strings)
      • Multiple rows with same key -> only one row written
      • Limitations
        • No write atomicity yet
        • No way to delete rows
        • Write parallelism is query-dependent (map vs reduce)
    • Map-Reduce Job for INSERT Facebook HBase From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
    • Map-Only Job for INSERT Facebook HBase
    • Query Processing
      • SELECT name, notes FROM users WHERE userid=‘xyz’;
      • Rows are read from HBase via org.apache.hadoop.hbase.mapred.TableInputFormatBase
      • HBase determines the splits (one per table region)
      • HBaseSerDe produces lazy rows/maps for RowResults
      • Column selection is pushed down
      • Any SQL can be used (join, aggregation, union…)
      • Limitations
        • Currently no filter pushdown
        • How do we achieve locality?
    • Metastore Integration
      • DDL can be used to create metadata in Hive and HBase simultaneously and consistently
      • CREATE EXTERNAL TABLE: register existing Hbase table
      • DROP TABLE: will drop HBase table too unless it was created as EXTERNAL
      • Limitations
        • No two-phase-commit for DDL operations
        • ALTER TABLE is not yet implemented
        • Partitioning is not yet defined
        • No secondary indexing
    • Bulk Load
      • Ideally…
      • SET hive.hbase.bulk=true;
      • But for now, you have to do some work and issue multiple Hive commands
        • Sample source data for range partitioning
        • Save sampling results to a file
        • Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files)
        • Import HFiles into HBase
        • HBase can merge files if necessary
    • Range Partitioning During Sort Facebook A-G H-Q R-Z HBase (H) (R) TotalOrderPartitioner loadtable.rb
    • Sampling Query For Range Partitioning
      • Given 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges.
      • select user_id from
      • (select user_id
      • from hive_user_table
      • tablesample(bucket 1 out of 1000 on user_id) s
      • order by user_id) sorted_user_5k_sample
      • where (row_sequence() % 501)=0;
    • Sorting Query For Bulk Load
      • set mapred.reduce.tasks=12;
      • set hive.mapred.partitioner=
      • org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
      • set total.order.partitioner.path=/tmp/hb_range_key_list;
      • set hfile.compression=gz;
      • create table hbsort(user_id string, user_type string, ...)
      • stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat’
      • outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat’ tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');
      • insert overwrite table hbsort
      • select user_id, user_type, createtime, …
      • from hive_user_table
      • cluster by user_id;
    • Deployment
      • Latest Hive trunk (will be in Hive 0.6.0)
      • Requires Hadoop 0.20+
      • Tested with HBase 0.20.3 and Zookeeper 3.2.2
      • 20-node hbtest cluster at Facebook
      • No performance numbers yet
        • Currently setting up tests with about 6TB (gz compressed)
    • Questions?
      • [email_address]
      • [email_address]
      • http://wiki.apache.org/hadoop/Hive/HBaseIntegration
      • http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad
      • Special thanks to Samuel Guo for the early versions of the integration code
    • Hey, What About HBQL?
      • HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregations
      • HBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobs