Hadoop, Hbase and Hive- Bay area Hadoop User Group
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
23,359
On Slideshare
21,829
From Embeds
1,530
Number of Embeds
31

Actions

Shares
Downloads
1,355
Comments
3
Likes
59

Embeds 1,530

http://tedwon.com 400
http://developer.yahoo.net 316
http://www.slideshare.net 179
http://developer.yahoo.com 172
http://www.readwriteweb.com 172
http://readwriteweb.com.br 84
http://webholic.com.br 39
http://yahoohadoop.tumblr.com 35
https://developer.yahoo.com 33
http://paper.li 21
http://readwrite.com 15
http://www.party09.com 10
http://devpub.kr 8
http://static.slidesharecdn.com 8
http://blog.iband.kr 7
http://54.199.180.60 5
http://115.68.2.182 5
https://www.tumblr.com 4
http://traackit.blogspot.sg 3
http://traackit.blogspot.kr 2
http://150.23.23.233:20000 2
http://localhost 1
http://hubot-clb-2081983768.ap-northeast-1.elb.amazonaws.com 1
http://party09.com 1
http://traackit.blogspot.fr 1
http://rss2.com 1
http://a0.twimg.com 1
http://www.cnliam.com 1
http://cache.baidu.com 1
http://www.hanrss.com 1
http://pmomale-ld1 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook +
  • 2. Agenda
    • Use Cases
    • Architecture
    • Storage Handler
    • Load via INSERT
    • Query Processing
    • Bulk Load
    • Q & A
    Facebook
  • 3. Motivations
    • Data, data, and more data
      • 200 GB/day in March 2008 -> 12+ TB/day at the end of 2009
      • About 8x increase per year
    • Queries, queries, and more queries
      • More than 200 unique users querying per day
      • 7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queries
    • They want it all and they want it now
      • Users expect faster response time on fresher data
      • Sampled subsets aren’t always good enough
    Facebook
  • 4. How Can HBase Help?
    • Replicate dimension tables from transactional databases with low latency and without sharding
      • (Fact data can stay in Hive since it is append-only)
    • Only move changed rows
      • “ Full scrape” is too slow and doesn’t scale as data keeps growing
      • Hive by itself is not good at row-level operations
    • Integrate into Hive’s map/reduce query execution plans for full parallel distributed processing
    • Multiversioning for snapshot consistency?
    Facebook
  • 5. Use Case 1: HBase As ETL Data Target Facebook HBase Hive INSERT … SELECT … Source Files/Tables
  • 6. Use Case 2: HBase As Data Source Facebook HBase Other Files/Tables Hive SELECT … JOIN … GROUP BY … Query Result
  • 7. Use Case 3: Low Latency Warehouse Facebook HBase Other Files/Tables Periodic Load Continuous Update Hive Queries
  • 8. HBase Architecture Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
  • 9. Hive Architecture Facebook
  • 10. All Together Now! Facebook
  • 11. Hive CLI With HBase
    • Minimum configuration needed:
    • hive
    • --auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar
    • -hiveconf hbase.zookeeper.quorum=zk1,zk2…
    • hive> create table …
    Facebook
  • 12. Storage Handler
    • CREATE TABLE users(
    • userid int, name string, email string, notes string)
    • STORED BY
    • 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    • WITH SERDEPROPERTIES (
    • “ hbase.columns.mapping” =
    • “ small:name,small:email,large:notes”)
    • TBLPROPERTIES (
    • “ hbase.table.name” = “user_list”
    • );
    Facebook
  • 13. Column Mapping
    • First column in table is always the row key
    • Other columns can be mapped to either:
      • An HBase column (any Hive type)
      • An HBase column family (must be MAP type in Hive)
    • Multiple Hive columns can map to the same HBase column or family
    • Limitations
      • Currently no control over type mapping (always string in HBase)
      • Currently no way to map HBase timestamp attribute
    Facebook
  • 14. Load Via INSERT
    • INSERT OVERWRITE TABLE users
    • SELECT * FROM …;
    • Hive task writes rows to HBase via org.apache.hadoop.hbase.mapred.TableOutputFormat
    • HBaseSerDe serializes rows into BatchUpdate objects (currently all values are converted to strings)
    • Multiple rows with same key -> only one row written
    • Limitations
      • No write atomicity yet
      • No way to delete rows
      • Write parallelism is query-dependent (map vs reduce)
    Facebook
  • 15. Map-Reduce Job for INSERT Facebook HBase From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
  • 16. Map-Only Job for INSERT Facebook HBase
  • 17. Query Processing
    • SELECT name, notes FROM users WHERE userid=‘xyz’;
    • Rows are read from HBase via org.apache.hadoop.hbase.mapred.TableInputFormatBase
    • HBase determines the splits (one per table region)
    • HBaseSerDe produces lazy rows/maps for RowResults
    • Column selection is pushed down
    • Any SQL can be used (join, aggregation, union…)
    • Limitations
      • Currently no filter pushdown
      • How do we achieve locality?
    Facebook
  • 18. Metastore Integration
    • DDL can be used to create metadata in Hive and HBase simultaneously and consistently
    • CREATE EXTERNAL TABLE: register existing Hbase table
    • DROP TABLE: will drop HBase table too unless it was created as EXTERNAL
    • Limitations
      • No two-phase-commit for DDL operations
      • ALTER TABLE is not yet implemented
      • Partitioning is not yet defined
      • No secondary indexing
    Facebook
  • 19. Bulk Load
    • Ideally…
    • SET hive.hbase.bulk=true;
    • INSERT OVERWRITE TABLE users SELECT … ;
    • But for now, you have to do some work and issue multiple Hive commands
      • Sample source data for range partitioning
      • Save sampling results to a file
      • Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files)
      • Import HFiles into HBase
      • HBase can merge files if necessary
    Facebook
  • 20. Range Partitioning During Sort Facebook A-G H-Q R-Z HBase (H) (R) TotalOrderPartitioner loadtable.rb
  • 21. Sampling Query For Range Partitioning
    • Given 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges.
    • select user_id from
    • (select user_id
    • from hive_user_table
    • tablesample(bucket 1 out of 1000 on user_id) s
    • order by user_id) sorted_user_5k_sample
    • where (row_sequence() % 501)=0;
    Facebook
  • 22. Sorting Query For Bulk Load
    • set mapred.reduce.tasks=12;
    • set hive.mapred.partitioner=
    • org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
    • set total.order.partitioner.path=/tmp/hb_range_key_list;
    • set hfile.compression=gz;
    • create table hbsort(user_id string, user_type string, ...)
    • stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat’
    • outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat’ tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');
    • insert overwrite table hbsort
    • select user_id, user_type, createtime, …
    • from hive_user_table
    • cluster by user_id;
    Facebook
  • 23. Deployment
    • Latest Hive trunk (will be in Hive 0.6.0)
    • Requires Hadoop 0.20+
    • Tested with HBase 0.20.3 and Zookeeper 3.2.2
    • 20-node hbtest cluster at Facebook
    • No performance numbers yet
      • Currently setting up tests with about 6TB (gz compressed)
    Facebook
  • 24. Questions?
    • [email_address]
    • [email_address]
    • http://wiki.apache.org/hadoop/Hive/HBaseIntegration
    • http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad
    • Special thanks to Samuel Guo for the early versions of the integration code
    Facebook
  • 25. Hey, What About HBQL?
    • HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregations
    • HBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobs
    Facebook